NLP Sentiment Analysis for beginners.

NLP Sentiment Analysis for Beginners

A Step-By-Step Approach to Understand TextBlob, NLTK, Scikit-Learn, and LSTM networks

Joe Tran

Jun 14 ·12min read

z26zuuu.jpg!web

Photo by Romain Vignes on Unsplash

Introduction

Natural Language Processing (NLP) is the area of machine learning that focuses on the generation and understanding of language. Its main objective is to enable machines to understand, communicate and interact with humans in a natural way.

NLP has many tasks such as Text Generation, Text Classification, Machine Translation, Speech Recognition, Sentiment Analysis, etc. For a beginner to NLP, looking at these tasks and all the techniques involved in handling such tasks can be quite daunting. And in fact, it is very difficult for a newbie to know where and how to start.

Out of all the NLP tasks, I personally think that Sentiment Analysis (SA) is probably the easiest, which makes it the most suitable starting point for anyone who wants to start go into NLP.

In this article, I will show you how to perform SA using various techniques, ranging from simple ones like TextBlob and NLTK to more advanced ones like Sklearn and Long Short Term Memory (LSTM) networks.

After reading this, you can expect to understand the followings:

Toolkits used in SA: TextBlob and NLTK
Algorithms used in SA: Naive Bayes, SVM, Logistic Regression and LSTM
Jargons like stop-word removal, stemming, bag of words, corpus, tokenisation etc.
Create a word cloud

The flow of this article:

Data cleaning and pre-processing
TextBlob
Algorithms: Logistic Regression, Naive Bayes, SVM and LSTM

Let’s get started!

myea6nZ.jpg!web

Just a pic of my messy work-from-home corner

Data and Problem Formulation

In this article, I will the sentiment data set that consists of 3000 sentences coming from reviews on imdb.com , amazon.com , and yelp.com . Each sentence is labeled according to whether it comes from a positive review (labelled as 1 ) or negative review (labelled as 0 ).

Data can be downloaded from the website . Alternatively, it can be downloaded from here (more recommended). The folder sentiment_labelled_sentences (containing the data file full_set.txt ) should be in the same directory as your notebook.

Load and Pre-process the Data

Set up and import libraries

%matplotlib inline
import string
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

Now, we load in the data and look at the first 10 comments

with open("sentiment_labelled_sentences/full_set.txt") as f:
    content = f.readlines()content[0:10]

ENJzYz2.png!web

## Remove leading and trailing white space
content = [x.strip() for x in content]## Separate the sentences from the labels
sentences = [x.split("\t")[0] for x in content]
labels = [x.split("\t")[1] for x in content]sentences[0:10]
labels[0:10]

ZNZfuqj.png!web

Separate sentences and labels

One can stop here for this section. But for me, I prefer transforming y into (-1,1) form, where -1 represents negative and 1 represents positive.

## Transform the labels from '0 v.s. 1' to '-1 v.s. 1'
y = np.array(labels, dtype='int8')
y = 2*y - 1

NOTICE THAT SO FAR WE HAVE NOT DONE ANY TO THE WORDS YET!!! The next section focuses on words in sentences.

Pre-processing the text data

To input data into the any model, the data input must be in vector form. We will do the following transformations:

Remove punctuation and numbers
Transform all words to lower-case
Remove stop words (e.g. the, a, that, this, it, …)
Tokenizer the texts
Convert the sentences into vectors, using a bag-of-words representation

I will explain some new jargons here.

Stop words : common words that are ‘not interesting’ for the task at hand. These usually include articles such as ‘a’ and ‘the’, pronouns such as ‘i’ and ‘they’, and prepositions such ‘to’ and ‘from’, …

def removeStopWords(stopWords, txt):
    newtxt = ' '.join([word for word in txt.split() if word not in stopWords])
    return newtxt
stoppers = ['a', 'is', 'of','the','this','uhm','uh']
removeStopWords(stoppers, "this is a test of the stop word removal code")

Or we can use NLTK if we do want a complete set of common stop words used

from nltk.corpus import stopwordsstops = stopwords.words("English")removeStopWords(stops, "this is a test of the stop word removal code.")

Same result

2. Corpus simply a collection of text. Order of words matter. ‘Not great’ is different from ‘great’

3. Document-Term Matrix or Bag of Words (BOW) is simply a vectorial representation of text sentences (or documents)

UjiuAjF.png!web

A common way to represent a set of features like this is called a One-Hot vector. For example, lets say our vocabular from our set of texts is:

today, here, I, a, fine, sun, moon, bird, saw

The sentence we want to build a BOW for is:

I saw a bird today.
Using a 1-0 for each word in the vocabulary, our BOW encoded as a one-hot vector would be:

1 0 1 1 0 0 1 1

In order to create a bag of words, we need to break down a long sentence or a document into smaller pieces. This process is called Tokenization . The most common tokenization technique is to break down text into words. We can do this using CountVectorizer in Scikit-Learn, where every row will represent a different document and every column will represent a different word. In addition, with CountVectorizer , we can also remove stop words.

def full_remove(x, removal_list):
    for w in removal_list:
        x = x.replace(w, ' ')
    return x## Remove digits
digits = [str(x) for x in range(10)]
remove_digits = [full_remove(x, digits) for x in sentences]## Remove punctuation
remove_punc = [full_remove(x, list(string.punctuation)) for x in remove_digits]## Make everything lower-case and remove any white space
sents_lower = [x.lower() for x in remove_punc]
sents_lower = [x.strip() for x in sents_lower]## Remove stop words
from nltk.corpus import stopwords
stops = stopwords.words("English")def removeStopWords(stopWords, txt):
    newtxt = ' '.join([word for word in txt.split() if word not in stopWords])
    return newtxtsents_processed = [removeStopWords(stops,x) for x in sents_lower]

Let’s look at how our sentences look like now

E3AnYbz.png!web

Uhm, wait a minute! Removing many stop words makes many sentences lose their meanings. For example, ‘way plug us unless go converter’ does not make any sense to me. This is because we remove all the common English stop words by using NLTK. To overcome this meaning problem, let’s create our own set of stop words instead.

stop_set = ['the', 'a', 'an', 'i', 'he', 'she', 'they', 'to', 'of', 'it', 'from']sents_processed = [removeStopWords(stop_set,x) for x in sents_lower]

eyuI3ir.png!web

It is ok to stop here and move to Tokenization. However, one can continue with stemming . The goal of stemming is too strip off prefixes and suffixes in the word and convert the word into its base form, e.g. studying->study, beautiful->beauty, cared->care, …In NLTK, there are 2 popular stemming techniques called porter and lanscaster.

import nltk
def stem_with_porter(words):
    porter = nltk.PorterStemmer()
    new_words = [porter.stem(w) for w in words]
    return new_words
    
def stem_with_lancaster(words):
    porter = nltk.LancasterStemmer()
    new_words = [porter.stem(w) for w in words]
    return new_words    
    
str = "Please don't unbuckle your seat-belt while I am driving, he said"print("porter:", stem_with_porter(str.split()))
print()
print("lancaster:", stem_with_lancaster(str.split()))

Let’s try on our sents_processed to see whether it makes sense.

porter = [stem_with_porter(x.split()) for x in sents_processed]porter = [" ".join(i) for i in porter]porter[0:10]

Some weird changes occur, e.g. very->veri, quality->qualiti, value->valu, …

I dont know what you think but I personally do not like stemming. Or maybe it can be useful in other cases. For those who are experts in this stemming, let me know when stemming is useful :)

4. Term Document Inverse Document Frequency (TD/IDF). This is a measure of the relative importance of a word within a document, in the context of multiple documents . In our case here, multiple reviews.

We start with the TD part — this is simply a normalized frequency of the word in the document:

(word count in document) / (total words in document)
The IDF is a weighting of the uniquess of the word across all of the documents. Here is the complete formula of TD/IDF:

td_idf(t,d) = wc(t,d)/wc(d) / dc(t)/dc()

where:

— wc(t,d) = # of occurrences of term t in doc d

— wc(d) = # of words in doc d

— dc(t) = # of docs that contain at least 1 occurrence of term t

— dc() = # of docs in collection

Now, let’s create a bag of words and normalise the texts

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformervectorizer = CountVectorizer(analyzer = "word", 
                             preprocessor = None, 
                             stop_words =  'english', 
                             max_features = 6000, ngram_range=(1,5))data_features = vectorizer.fit_transform(sents_processed)
tfidf_transformer = TfidfTransformer()
data_features_tfidf = tfidf_transformer.fit_transform(data_features)
data_mat = data_features_tfidf.toarray()

Now data_mat is our Document-Term matrix. Input is ready to put into model. Let’s create Training and Test sets. Here, I split the data into a training set of 2500 sentences and a test set of 500 sentences (of which 250 are positive and 250 negative).

np.random.seed(0)
test_index = np.append(np.random.choice((np.where(y==-1))[0], 250, replace=False), np.random.choice((np.where(y==1))[0], 250, replace=False))train_index = list(set(range(len(labels))) - set(test_index))train_data = data_mat[train_index,]
train_labels = y[train_index]test_data = data_mat[test_index,]
test_labels = y[test_index]

TextBlob

TextBlob : Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us to take advantage of these labels. TextBlod finds all the words and phrases that it can assign polarity and subjectivity to, and average all of them together
Sentiment Labels : Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we’re going to ignore them for now). A corpus’ sentiment is the average of these.

Polarity : How positive or negative a word is. -1 is very negative. +1 is very positive.
Subjectivity : How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion.

from textblob import TextBlob#Create polarity function and subjectivity functionpol = lambda x: TextBlob(x).sentiment.polaritysub = lambda x: TextBlob(x).sentiment.subjectivitypol_list = [pol(x) for x in sents_processed]sub_list = [sub(x) for x in sents_processed]

This is a rule-based method that determines the sentiment (polarity and subjectivity) of a review.

The next section will look at various algorithms.

Logistic Regression

from sklearn.linear_model import SGDClassifier## Fit logistic classifier on training data
clf = SGDClassifier(loss="log", penalty="none")
clf.fit(train_data, train_labels)## Pull out the parameters (w,b) of the logistic regression model
w = clf.coef_[0,:]
b = clf.intercept_## Get predictions on training and test data
preds_train = clf.predict(train_data)
preds_test = clf.predict(test_data)## Compute errors
errs_train = np.sum((preds_train > 0.0) != (train_labels > 0.0))
errs_test = np.sum((preds_test > 0.0) != (test_labels > 0.0))print("Training error: ", float(errs_train)/len(train_labels))
print("Test error: ", float(errs_test)/len(test_labels))Training error:  0.0116
Test error:  0.184

Words with large influence

Which words are most important in deciding whether a sentence is positive? As a first approximation to this, we simply take the words whose coefficients in w have the largest positive values.

Likewise, we look at the words whose coefficients in w have the most negative values, and we think of these as influential in negative predictions.

## Convert vocabulary into a list:
vocab = np.array([z[0] for z in sorted(vectorizer.vocabulary_.items(), key=lambda x:x[1])])## Get indices of sorting w
inds = np.argsort(w)## Words with large negative values
neg_inds = inds[0:50]
print("Highly negative words: ")
print([str(x) for x in list(vocab[neg_inds])])## Words with large positive values
pos_inds = inds[-49:-1]
print("Highly positive words: ")print([str(x) for x in list(vocab[pos_inds])])

eaqQV3b.png!web

Create a Word Cloud

from wordcloud import WordCloud
wc = WordCloud(stopwords=stop_set, background_color="white", colormap="Dark2",
               max_font_size=150, random_state=42)#plt.rcParams['figure.figsize'] = [16, 6]wc.generate(" ".join(list(vocab[neg_inds])))plt.imshow(wc, interpolation="bilinear")
plt.axis("off")    
plt.show()

my2qqu3.png!web

Naive Bayes

from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB().fit(train_data, train_labels)nb_preds_test = nb_clf.predict(test_data)
nb_errs_test = np.sum((nb_preds_test > 0.0) != (test_labels > 0.0))
print("Test error: ", float(nb_errs_test)/len(test_labels))Test error:  0.174

Let’s do some prediction cases. [1] means positive and [-1] means negative

print(nb_clf.predict(vectorizer.transform(["It's a sad movie but very good"])))[1]print(nb_clf.predict(vectorizer.transform(["Waste of my time"])))[-1]print(nb_clf.predict(vectorizer.transform(["It is not what like"])))[-1]print(nb_clf.predict(vectorizer.transform(["It is not what I m looking for"])))[1]

The last test case has problem. It should be a negative comment but the model predicts positive.

SVM

from sklearn.linear_model import SGDClassifiersvm_clf = SGDClassifier(loss="hinge", penalty='l2')
svm_clf.fit(train_data, train_labels)svm_preds_test = svm_clf.predict(test_data)
svm_errs_test = np.sum((svm_preds_test > 0.0) != (test_labels > 0.0))print("Test error: ", float(svm_errs_test)/len(test_labels))Test error:  0.2

Again, let’s do some prediction

print(svm_clf.predict(vectorizer.transform(["This is not what I like"])))[-1]print(svm_clf.predict(vectorizer.transform(["It is not what I am looking for"])))[-1]print(svm_clf.predict(vectorizer.transform(["I would not recommend this movie"])))[1]

SVM can predict the comment ‘It is not what I am looking for’ correctly. However, it could not predict the comment ‘I do not recommend this movie’.

LSTM networks

A detailed discussion about LSTM networks can be found here .

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import SpatialDropout1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStoppingmax_review_length = 200tokenizer = Tokenizer(num_words=10000,  #max no. of unique words to keep
                      filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', 
                      lower=True #convert to lower case
                     )
tokenizer.fit_on_texts(sents_processed)

Truncate and pad the input sequences so that they are all in the same length

X = tokenizer.texts_to_sequences(sents_processed)
X = sequence.pad_sequences(X, maxlen= max_review_length)
print('Shape of data tensor:', X.shape)Shape of data tensor: (3000, 200)

Recall that y is vector of 1 and -1. Now I change it to a matrix with 2 columns that represent -1 and 1.

import pandas as pd
Y=pd.get_dummies(y).values
Y

uQbiqmA.png!web

np.random.seed(0)
test_inds = np.append(np.random.choice((np.where(y==-1))[0], 250, replace=False), np.random.choice((np.where(y==1))[0], 250, replace=False))
train_inds = list(set(range(len(labels))) - set(test_inds))train_data = X[train_inds,]
train_labels = Y[train_inds]test_data = X[test_inds,]
test_labels = Y[test_inds]

Create networks

EMBEDDING_DIM = 200model = Sequential()
model.add(Embedding(10000, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(250, dropout=0.2,return_sequences=True))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

F7fEFfB.png!web

epochs = 2
batch_size = 40model.fit(train_data, train_labels, 
          epochs=epochs, 
          batch_size=batch_size,
          validation_split=0.1)

loss, acc = model.evaluate(test_data, test_labels, verbose=2,
                            batch_size=batch_size)print(f"loss: {loss}")
print(f"Validation accuracy: {acc}")

LSTM performs the best out of all the models trained so far, i.e. Logistic, Naive Bayes and SVM. Now let’s see how it predict a test case

outcome_labels = ['Negative', 'Positive']new = ["I would not recommend this movie"]

seq = tokenizer.texts_to_sequences(new)
padded = sequence.pad_sequences(seq, maxlen=max_review_length)
pred = model.predict(padded)
print("Probability distribution: ", pred)
print("Is this a Positive or Negative review? ")
print(outcome_labels[np.argmax(pred)])

new = ["It is not what i am looking for"]

new = ["This isn't what i am looking for"]

For this case, the difference between probability for negative and positive is not much. And the LSTM model classifies this as positive.

new = ["I wouldn't recommend this movie"]

The same happens for this comment. Hence, this means that our model could not distinguish between n't and not . One possible solution to this would be, in the pre-processing step, instead of removing all punctuations, change all the n't short form into not . This can simply be done with the re module in Python. You can check it out yourself to see how our models prediction improve.

That is it! Hope you guys enjoyed and picked up something from this article. If you have any questions, feel free to put them down in the comment section below. Thank you for your read. Have a great day and take care everyone!!

e2uIZ3B.jpg!web

Photo by Lucas Clara on Unsplash