Sentiment Analysis using LightGBM — Alternative approach to RNN and LSTM

Tapas Das
5 min readSep 8, 2020
Pic Courtesy: https://www.netbase.com/blog/what-is-social-sentiment-analysis/

Sentiment analysis is the interpretation and classification of emotions (positive, negative and neutral) within text data using machine learning techniques.

Understanding people’s emotions is essential for businesses since customers are able to express their thoughts and feelings more openly than ever before. By automating the process of analysing customer feedback, from survey responses to social media conversations, brands are able to listen attentively to their customers, and tailor products and services to meet their needs.

In machine learning, traditionally it’s considered best practice to use Recurrent Neural Networks (RNNs) or Long Short Term Memory (LSTM) frameworks for text analysis and sentiment classification, since they provide state-of-the-art results.

However, in this post, I’d like to discuss an alternative approach using LightGBM, which works just as good as RNNs or LSTMs, provided we have done intelligent feature engineering and feature selection.

Let’s get started.

1) Dataset Description

In this post, I’ll be using the “Product Sentiment Classification” dataset hosted by MachineHack. Below image shows the different features in the dataset.

Also, below image shows the top-5 records from the training dataset.

As we can see from the image, the “Product_Description” field contains the product reviews provided by the users. This field will be the major focus for us in order to implement the sentiment analysis model.

2) Feature Engineering (prior to text cleansing)

Prior to cleaning up the text in “Product_Description”, we will extract some basic features (mentioned below) from the user reviews, which will be helpful in model building.

  1. Number of words
  2. Number of unique words
  3. Number of characters
  4. Number of stop-words
  5. Number of punctuation
  6. Number of emojis
  7. Number of Hashtags
  8. Number of @
  9. Number of numerics in the text
  10. Number of title case words
  11. Average length of the words

Below images shows the python code for extracting the above mentioned features.

3) Text Cleansing

Next we will apply some basic text cleansing techniques (mentioned below) to get the text data ready for further feature engineering.

  • Lowercase all text
  • Remove new-line (\n) characters
  • Remove hyperlinks
  • Remove “@” symbols
  • Remove emojis
  • Remove punctuation, numbers and special characters
  • Remove short words (text length < 3)

I have created a consolidated NLP text cleansing function to clean the text data in one go. Below image shows the same.

After the text data has been cleaned, we will use “SpaCy” python library to lemmatize the data.

Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word.

Examples of lemmatization:

-> rocks : rock
-> corpora : corpus
-> better : good

Below images show the process to lemmatize the text data.

Below image shows the comparison between raw and cleansed text data in “Product_Description” field.

4) Feature Engineering (post text cleansing)

We will be using the below 4 python frameworks to perform further feature engineering on cleansed text data.

  • TextBlob — to extract Sentiment and Polarity features
  • Tensorflow Universal Sentence Encoder
  • SkLearn TF-IDF Vectorizer
  • SkLearn Count Vectorizer

TextBlob:

Below image shows the python code to extract “Sentiment” and “Polarity” from text data using the TextBlob library.

Tensorflow Universal Sentence Encoder:

Below image shows the python code to use Tensorflow’s Universal Sentence Encoder, for text encoding. It generates 512 features from any text data.

You can refer the below link to acquire better understanding of “Universal Sentence Encoder”.

SkLearn TF-IDF and Count Vectorizer:

Below image shows the python code to get the word count and Inverse Data Frequency from text data.

You can refer the below link to gain better understanding on these techniques.

5) Building the LightGBM model

After we are done with text data cleansing and feature engineering, the final step is to build the final LightGBM model for sentiment classification.

Below image shows the hyper-parameters used for building the model.

Once the hyper-parameters are set, we can perform cross-validation and model prediction, as shown in the below image.

Concluding Remarks

This concludes the sentiment analysis and classification algorithm using LightGBM model. The steps mentioned in this post helped me in achieving 5th rank on the MachineHack leaderboard for “Product Sentiment Classification” hackathon.

You can find the codebase for this post at below link.

Do leave me your comments, feedback and challenges (if you’re facing any) and I’ll touchbase with you individually to collaborate together.

Also please visit my blog (link below) to explore more on Machine Learning and Linux Computing.

--

--