Sentiment Analysis using LightGBM — Alternative approach to RNN and LSTM

5 min readSep 8, 2020

Pic Courtesy: https://www.netbase.com/blog/what-is-social-sentiment-analysis/

Sentiment analysis is the interpretation and classification of emotions (positive, negative and neutral) within text data using machine learning techniques.

Understanding people’s emotions is essential for businesses since customers are able to express their thoughts and feelings more openly than ever before. By automating the process of analysing customer feedback, from survey responses to social media conversations, brands are able to listen attentively to their customers, and tailor products and services to meet their needs.

In machine learning, traditionally it’s considered best practice to use Recurrent Neural Networks (RNNs) or Long Short Term Memory (LSTM) frameworks for text analysis and sentiment classification, since they provide state-of-the-art results.

However, in this post, I’d like to discuss an alternative approach using LightGBM, which works just as good as RNNs or LSTMs, provided we have done intelligent feature engineering and feature selection.

Let’s get started.

1) Dataset Description

In this post, I’ll be using the “Product Sentiment Classification” dataset hosted by MachineHack. Below image shows the different features in the dataset.

Also, below image shows the top-5 records from the training dataset.

As we can see from the image, the “Product_Description” field contains the product reviews provided by the users. This field will be the major focus for us in order to implement the sentiment analysis model.

2) Feature Engineering (prior to text cleansing)

Prior to cleaning up the text in “Product_Description”, we will extract some basic features (mentioned below) from the user reviews, which will be helpful in model building.

Number of words
Number of unique words
Number of characters
Number of stop-words
Number of punctuation
Number of emojis
Number of Hashtags
Number of @
Number of numerics in the text
Number of title case words
Average length of the words

Below images shows the python code for extracting the above mentioned features.

3) Text Cleansing

Next we will apply some basic text cleansing techniques (mentioned below) to get the text data ready for further feature engineering.

Lowercase all text
Remove new-line (\n) characters
Remove hyperlinks
Remove “@” symbols
Remove emojis
Remove punctuation, numbers and special characters
Remove short words (text length < 3)

I have created a consolidated NLP text cleansing function to clean the text data in one go. Below image shows the same.

After the text data has been cleaned, we will use “SpaCy” python library to lemmatize the data.

Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word.

Examples of lemmatization:

-> rocks : rock
-> corpora : corpus
-> better : good

Below images show the process to lemmatize the text data.

Below image shows the comparison between raw and cleansed text data in “Product_Description” field.

4) Feature Engineering (post text cleansing)

We will be using the below 4 python frameworks to perform further feature engineering on cleansed text data.

TextBlob — to extract Sentiment and Polarity features
Tensorflow Universal Sentence Encoder
SkLearn TF-IDF Vectorizer
SkLearn Count Vectorizer

TextBlob:

Below image shows the python code to extract “Sentiment” and “Polarity” from text data using the TextBlob library.

Tensorflow Universal Sentence Encoder:

Below image shows the python code to use Tensorflow’s Universal Sentence Encoder, for text encoding. It generates 512 features from any text data.

You can refer the below link to acquire better understanding of “Universal Sentence Encoder”.

Universal Sentence Encoder | TensorFlow Hub

This notebook illustrates how to access the Universal Sentence Encoder and use it for sentence similarity and sentence…

www.tensorflow.org

SkLearn TF-IDF and Count Vectorizer:

Below image shows the python code to get the word count and Inverse Data Frequency from text data.

You can refer the below link to gain better understanding on these techniques.

How to Encode Text Data for Machine Learning with scikit-learn - Machine Learning Mastery

Text data requires special preparation before you can start using it for predictive modeling. The text must be parsed…

machinelearningmastery.com

5) Building the LightGBM model

After we are done with text data cleansing and feature engineering, the final step is to build the final LightGBM model for sentiment classification.

Below image shows the hyper-parameters used for building the model.

Once the hyper-parameters are set, we can perform cross-validation and model prediction, as shown in the below image.

Concluding Remarks

This concludes the sentiment analysis and classification algorithm using LightGBM model. The steps mentioned in this post helped me in achieving 5th rank on the MachineHack leaderboard for “Product Sentiment Classification” hackathon.

You can find the codebase for this post at below link.

dlaststark/machine-learning-projects

Centralized repository to store and handle all my machine learning projects - dlaststark/machine-learning-projects

github.com

Do leave me your comments, feedback and challenges (if you’re facing any) and I’ll touchbase with you individually to collaborate together.

Also please visit my blog (link below) to explore more on Machine Learning and Linux Computing.