In this article, we’re going to learn about the two core approaches for estimating sentiment for sentiment analysis in Finance.
Just a quick recap though, recall that we said sentiment analysis, at least in finance, involves quantifying and exploiting sentiment or emotions for some sort of investment purpose.
Or indeed, it’s about estimating sentiment. And then linking it to other firm characteristics to gain a better understanding of how sentiment can drive firm performance.
Further recall that we said sentiment itself can include things like positivity; as well as negativity, uncertainty, narcissism. And a whole host of other human emotions that you might think of when you think of the word sentiment.
Approaches of Sentiment Analysis in Finance
Now, when we think about estimating sentiment for sentiment analysis in finance, there’s broadly two approaches you can take.
The first is what’s called the lexicon or dictionary based approach.
And the second is broadly some sort of “” / “” approach.
This holds regardless of whether you’re looking to estimate , or create a for the of the / as a whole.
Let’s consider both, starting with the lexicon or dictionary based approach.
Lexicon based approach for sentiment analysis in finance
The lexicon based approach is perhaps the most common way of estimating sentiment. You’ll see why later on in the post. For now though, let’s crack on with learning how it works.
This Article features concepts that are covered extensively in our course on Investment Analysis with Natural Language Processing (NLP).
If you’re interested in learning how to leverage the power of text data for investment analysis while working with real world data, you should definitely check out the course.
Start with a Prior
For the lexicon based approach, we typically start with some sort of a prior or belief or opinion on what constitutes words relating to sentiment. Be that , , uncertainty, or indeed any other type of .
The key idea is that we start with an opinion on what we believe are words that plausibly relate to sentiment. And we call that collection of words a sentiment language.
And by the way, we tend to denote the sentiment language, as (the Greek letter “psi“).
Importantly, sentiment language is the same as sentiment lexicon. Which is the same as sentiment dictionary.
People who apply tend to use these terms interchangeably.
Consistent with other aspects in , we just use a bunch of words which mean exactly the same thing. Because it tends to make us sound clever.
But jokes aside, we can either create the language from scratch, or we can work with existing language.
If we create the sentiment language from scratch, then we need to start with our own priors or our own beliefs.
Priors on which words we think plausibly relate to positivity, negativity, or any type of sentiment that we’re looking at.
For instance, if we wanted to create a custom dictionary or lexicon for positivity / positive sentiment…
Then we might argue that words like “happy”, “positive”, “growth”, “increase”, or “excitement” are all words which plausibly relate to positivity.
We would then say that those words together make up our positive sentiment language or .
Similarly, if we wanted to create a lexicon for , then we might argue that words like “sad”, “disappoint”, “decline”, “decrease” are words which plausibly reflect .
Or agree with others’ priors
Alternatively, we could work with existing dictionaries or lexicons.
And in that case, we’re essentially agreeing with the priors, or beliefs, or opinions of other people.
The people who’ve created these dictionaries of a bunch of words, which they think plausibly relate to a specific kind of sentiment.
Want to go further in Financial Sentiment Analysis?
Get the Investment Analysis with NLP Study Pack (for FREE!).
Regardless of whether you use a custom sentiment language or an existing sentiment language…
The idea is that once you have this sentiment language…
You can estimate sentiment as a function of the words in a given document which belong to a sentiment language.
Strictly, you’d want to look at the cleaned words which belong to a sentiment language. For instance, which of the cleaned words belong to positive sentiment, and which of those cleaned words belong to negative sentiment, etc.
Now, you don’t really need to worry about what we mean by cleaned words just yet. We have got a separate post on how to clean text data.
The only thing you need to know for now is that we don’t really work with the words in a given document.
Instead, we work with the cleaned words in a given document.
So that’s the words after we’ve performed text cleaning.
In a nutshell, we’re literally just looking at the cleaned words inside a document, and identifying which of those cleaned words relate to positive language. And which of those relate to negative language. Or which of those relate to the specific sentiment language that we’re trying to explore.
All right. So that’s as far as the lexicon based approach to estimating sentiment for sentiment analysis in finance goes.
Machine learning based approach for sentiment analysis in finance
Perhaps more fancy than the standard lexicon based approach, the machine learning approach is arguably more objective. However, data limitations mean we can’t quite leverage its power (yet). But here’s how you’d go about applying it.
Start with a subsample
In the machine learning / approach, we typically start with a sub sample of the Corpus which displays sentiment.
Remember, a Corpus is just the entire sample of text data that you’re working with.
So essentially, you would start with a sub sample of that Corpus which displays or shows sentiment. Put differently, a sub sample for which you already have estimates of sentiment.
And a typical, classic example, is the case of movie reviews.
Outside of Finance, when people are looking to estimate sentiment, the classic example is the IMDB movie reviews dataset where you’ve got a bunch of movie reviews, which are labeled as either “positive” or “negative”.
And so in that case, you’d have a sub sample of the data, which is labeled as either positive or negative.
Train, Test, Iterate
You would then apply some sort of classification algorithm or machine learning algorithm to classify or estimate sentiment for other sample texts.
In other words, you would train the algorithm on a sub sample dataset which has sentiment labels. And then you would apply the algorithm on other data to classify them as either positive or negative.
Keep in mind that the machine learning approach isn’t quite as simple and straightforward as we’ve made it seem.
It can in fact get far more complicated than what we’ve written here.
Issues with applying it in Finance
But the reason we’re not really going into much more detail is because, fundamentally, applying this approach for sentiment analysis in Finance is quite problematic.
Lack of classified
And that is because we don’t really have financial text data with sentiment labels in Finance!
There’s plenty of IMDB movie reviews, for instance, that have labels of either positive or negative sentiment, or some sort of rating scale between 1 and 10.
And you could easily say, or plausibly argue, that any movie with a rating greater than five is plausibly positive. Any movie with a rating below five is plausibly negative. And any movie with a rating of 5 is plausibly neutral.
But we don’t quite have that sort of luxury in finance.
There’s no database – or no existing database – at least at the time of writing, which has already classified companies based on their level of sentiment.
We don’t have a database of each company on the S&P500 for example, classified as either displaying positive tone, or negative tone, or net positive tone, or any other sentiment for that matter.
Finding Identifiers to Merge On
Now, you might argue that it should be fairly trivial to create such a database. For instance, why not just scrape the / accounts of companies, and classify their tweets on ?
As trivial as that actually is, the issue then becomes one of connecting that to . Or to any of the company’s financial fundamentals for that matter.
Because in , conducting usually involves linking to firms.
The end result could be creating a that exploits of firms over the long term; or indeed, one that uses some sort of “ ” to determine the buy / sell / hold decision for a given .
But doing so will be challenging because the profiles / profiles of companies don’t tend to have their tickers, CIK number, CUSIP, or other firm level identifiers!
That in turn makes merging with other financial tricky.
And so at least today, machine learning approaches in finance are largely focused on traditional asset pricing.
For instance, machine learning algorithms are used to try and explain the cross-section of returns. Or even the volatility of returns.
But as far as estimating sentiment for sentiment analysis in finance goes…
The literature – both academic and practitioner – largely relies on the lexicon or dictionary based approach.
In summary, we learned that sentiment can broadly be estimated using a lexicon / dictionary based approach, or a machine learning approach.
Applying the machine learning approach in finance can be quite problematic. Because, at least today, there’s no existing database of companies labeled based on their level of sentiment.
And so for the most part, both academics and practitioners largely rely on using the lexicon or dictionary based approach for estimating sentiment.
Perhaps most importantly, we learned that for the dictionary / lexicon based approach, you can estimate sentiment as a function of the cleaned words which belong to a sentiment language.
Do you want to build a rigorous investment analysis system that leverages the power of text data with Python?