Doing your first sentiment analysis in R with Sentimentr

The Sentimentr package for R is immensely helpful when it comes to analyzing text for psychological or sociological studies. Its first big advantage is that it makes sentiment analysis simple and achievable within a few lines of code. Its second big advantage is that it corrects for inversions, meaning that while a more basic sentiment analysis would judge “ I am not good ” as positive due to the adjective good , Sentimentr recognizes the inversion of good and classifies it as negative.

All in all, Sentimentr allows you to quickly do a sophisticated sentiment analysis and directly use it as an input for your regression or any other further analysis.

This article covers how to get started. If you are looking for advanced analyzing techniques, please refer to other articles such as Tyler Rinker’s Github Repo ’s Readme. For the purpose of this tutorial, I will be analyzing Amazon Reviews on Beauty products from the He & McAuley (2016) Dataset . However, you can easily adapt the code to make it fit your own dataset.

jMFfAvb.jpg!web

Photo by Obi Onyeador on Unsplash

By default, Sentimentr uses the Jockers (2017) dictionary , which should be perfect for most circumstances.

Installing the packages and loading the data

install.packages("sentimentr")
library(sentimentr)

The first two commands install and load the Sentimentr package. Next, I am loading the data. As it is in JSON format, I need to load the ndjson package. I can then use the package’s stream_in function to load the Amazon Beauty Data.

install.packages("ndjson")
library(ndjson)
df = stream_in("AmazonBeauty.json")
head(df)

I also used the head function to quickly look at the first couple of rows of the data. As you will be able to see when performing this on your own machine, there is a column called reviewText that contains the reviews.

Doing the actual sentiment analysis

sentiment=sentiment_by(df$reviewText)

This command runs the sentiment analysis. In this case, I used the sentiment_by command to get an aggregate sentiment measure for the entire review. In other cases, you could use the sentiment command (without _by) to get the sentiment per sentence.

While this command runs (it does take a while), I will discuss what the function will return. The sentiment object in this example will be a data.table including the following columns:

element_id — The id number of the review
word_count — The word count of the review
sd — The standard deviation of the sentiment score of the sentences in the review
ave_sentiment — The average sentiment score of the sentences in the review

The most interesting variable is the ave_sentiment , which is the sentiment of the review in one number. The number can take positive or negative values and expresses the valence and the polarity of the sentiment.

Analyzing the Sentiment Scores

We can look at some summary statistics of the calculated sentiment scores.

summary(sentiment$ave_sentiment)

As you can see, most of the reviews tend to be moderately positive, but there are some extreme outliers with the most positive review being 3.44 and the most negative being -1.88. These are quite far away from the mean and the median and one should consider removing them for any further analysis.

I also did a quick histogram to look at the sentiment of the reviews.

library(ggplot2)
qplot(sentiment$ave_sentiment,   geom="histogram",binwidth=0.1,main="Review Sentiment Histogram")

vy6BfqZ.png!web

Integrating your sentiment scores into the original dataset

As I am most interested in the sentiment scores, I will conclude this tutorial by integrating the sentiment scores and their standard deviation back into the main dataset.

df$ave_sentiment=sentiment$ave_sentiment
df$sd_sentiment=sentiment$sd

I hoped this helped.

Installing the packages and loading the data

Doing the actual sentiment analysis

Analyzing the Sentiment Scores

Integrating your sentiment scores into the original dataset

Recommend

如何审计一个智能合约

3 Reasons Why I’m Ditching SSIS for Python

设计模式-原型模式

Factoring 2048-bit Numbers Using 20 Million Qubits

The AI Box Experiment

Prototyping an anomaly detection system for videos, step by step using LSTM conv...

谷歌联合斯坦福推出可解释 AI 新方法，揭秘图像分类器到底是如何工作的

使用Frida绕过Android App的SSL Pinning

一文了解超级账本DLT、库、开发工具有哪些，Hyperledger家族成员你认识几个？

学习 Spring 的思考框架

About Joyk