In this article, we’re going to learn how to estimate sentiment using what’s called a document term matrix or “DTM”.
What is a Document Term Matrix (DTM)?
Firstly, what exactly is a document matrix?
Well, it’s a matrix. And specifically, it’s a matrix that represents the words that are inside a Corpus.
So we take our entire Corpus; we take all of the words that are inside our text corpus… and we transform it into a mathematical matrix, where each column represents a unique word in the entire text corpus.
Each column represents one unique word that exists in the entire Corpus across all text documents.
And each row represents a unique text document.
And it’s this particular structure; this particular format, that’s the reason that this thing is called a document term matrix.
Because if you think about a matrix, we typically define a mathematical matrix as , where refers to the number of rows and refers to the number of columns.
When working with text data, you’ve got number of documents and number of terms or words.
So this is an or a Document Term matrix, or simply just a Document Term Matrix (DTM).
Document Term Matrix vs. Term Document Matrix
Some people call the DTM a Term Document Matrix. And that’s only because they put the words in the rows, and put the documents in the columns.
It’s just another way of storing that information.
But for the most part, people tend to call it a Document Term Matrix (DTM).
Why use a Document Term Matrix (DTM)?
While it’s possible to estimate sentiment without a Document Term Matrix, a DTM can be extremely useful when working with really large datasets. Especially for instance, when working on text mining with “Big Data”.
Using a DTM creates a coherent structure for what is otherwise an “unstructured” format of data. This in turn allows for better / .
The matrix representation of the corpus object allows for estimating sentiment significantly more efficiently, compared to say, if one estimated sentiment iteratively.
What does a DTM look like?
Now let’s think about what this actually looks like.
Firstly, it’s useful to think of a Corpus as a bunch of documents.
If we see the Corpus object in that way, then we can think of each document within the Corpus as a bag of words or a list of words.
Consider a small “toy” corpus with just 6 documents relating to 3 firms over 2 time periods.
Here represents the total number of firms, in this case, three.
And we were looking at two specific time periods, and .
So we’ve got a bunch of text documents, and each document has a list of words, or a bag of words. Because fundamentally, that’s exactly what a text document is. It’s just got a bunch of words.
This Article features a concept that is covered extensively in our course on Investment Analysis with Natural Language Processing (NLP).
If you’re interested in leveraging the power of text data for investment analysis, you should definitely check out the course.
Now, since we can think of every single document in this way, what if we just took all of the unique words and “plonk” them into a matrix?
Well, if we did that, we’d have something like this…
We’ve still got those same six text documents between three firms over two time periods.
But rather than looking at individual bags of words for each and every document, we can just get all of the unique words across all documents and place them as individual columns.
The columns represent unique words, which means, of course, each word only shows up one time.
And all of the words that show up ( words in total) come from the entire Corpus.
Thus, that is literally is the entire Corpus, but just in unique terms.
Essentially, transforming / creating the DTM is akin to , because we’ve literally got a bunch of tokens now across multiple columns.
So while previously we could only see our Corpus as either a bunch of files, or a bunch of lists of words…
We can now see the entire Corpus and look at all of the information that’s inside the Corpus.
And this whole thing here is then our document term matrix.
The values inside the Document Term Matrix represent the frequency counts of the individual words that show up in individual documents.
Put differently, each value represents a term count (or term frequency).
So if we focus our attention on the first row up there, then what it tells us is that for the document :
- shows up 4 times
- and don’t show up at all
- shows up once
- shows up twice
- , the final word in the corpus, doesn’t show up
Similarly for document (i.e., the document for firm 2 at time 1), there’s no occurrence of ; five occurrences of ; no occurrences of ; three occurrences of , and so on and so forth. You get the idea.
We’re literally just counting the number of times a given occurred, and calling that the , or , or – call it whatever you fancy.
And of course it’s the same principle and the same interpretation across all documents.
Importantly, notice that the vast majority of values inside the DTM are zeros. This is normal. And the bigger the DTM, the more zeros you’ll see.
That’s because by construction, the DTM is a sparse matrix, and the majority of the values will be 0.
Now this particular Document Term Matrix has all of the unique words in the entire Corpus.
When working with text data, we don’t tend to work with all the words in the corpus; not even all the unique words.
Instead, we tend to only work with “cleaned words”.
Want to go further in Financial Sentiment Analysis?
Get the Investment Analysis with NLP Study Pack (for FREE!).
Document Term Matrix Subset (Cleaned Words)
The Document Term Matrix in its current form then, isn’t particularly useful for estimating sentiment.
But of course we can create a similar Document Term Matrix using only the cleaned words in the Corpus.
And that might look something like this:
So we’ve just got the same format.
So we’ve got documents only this time, they’re cleaned documents, and we’ve got words except they’re cleaned words.
The superscript here denotes “cleaned”, so represents a “cleaned word”, and represents a “cleaned document”.
And this is now a Document Term Matrix of all of the cleaned words in the entire Corpus, across all documents.
The interpretation of this particular Document Term Matrix is of course, identical to the interpretation of the previous DTM.
By the way, in case you’re wondering how we actually clean text data, we’ve got an article on that here. Long story short though, text cleaning involves:
- Removing unwanted characters
- Harmonising letter case, and
- Removing stopwords (i.e., the most common words in the Corpus)
Estimating Sentiment using a DTM
Now, how exactly do we use a Document Term Matrix to estimate sentiment?
Given that we can use the same approach to create a Document Term Matrix of cleaned words, as we did for just all of the words in the entire Corpus…
We can of course create a Document Term Matrix which only comprises of the words which belong to a specific sentiment language (aka sentiment dictionary, sentiment vocabulary).
And that will look something like this:
Now we’ve got cleaned words which belong to a specific sentiment language .
Of course, the interpretation of this “subset” DTM is identical to the previous versions.
The only difference is, rather than having all of the unique words in the entire Corpus. Or indeed, having all of the unique cleaned words that are inside the entire Corpus…
We now only have the cleaned words which belong to a specific sentiment language.
Now, let’s say for simplicity that we are only looking at positive sentiment.
And our positive sentiment lexicon / vocabulary, again for simplicity, just has three positive words inside it.
Then our DTM might look like this:
So we’ve got “happy”, “good”, and “amazing”, which are the only words in our sentiment language.
And given the interpretation of the document term matrix, and the frequency counts available, we can get an idea of which specific sentiment language words appear in which documents.
Now, why bother with all of this?
Because if you look closely, the sum across every single row here is nothing but our estimate for sentiment.
The sum of each row represents , or sentiment, estimated using a frequency counts approach.
Because remember, the values inside this “subset” document term matrix are literally just the frequency counts of all of the cleaned words in a given document, which belong to a sentiment language.
So if we take a look at our toy example, again, we have the count for positive sentiment!
We’re just calling that “pos_count”.
And each value in the “pos_count” column essentially represents (positive sentiment).
Strictly in this case, represents positive sentiment, estimated using a frequency counts approach.
Of course, estimating sentiment using a proportional counts approach is trivial after having obtained the frequency counts based estimates
Because all we’d need to do is divide every single row in the “pos_count” column by the total number of cleaned words for each document.
It literally is that simple!
Do you want to build a rigorous investment analysis system that leverages the power of text data with Python?