43

Doing your first sentiment analysis in R with Sentimentr

 4 years ago
source link: https://www.tuicool.com/articles/A7rUn2r
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

The Sentimentr package for R is immensely helpful when it comes to analyzing text for psychological or sociological studies. Its first big advantage is that it makes sentiment analysis simple and achievable within a few lines of code. Its second big advantage is that it corrects for inversions, meaning that while a more basic sentiment analysis would judge “ I am not good ” as positive due to the adjective good , Sentimentr recognizes the inversion of good and classifies it as negative.

All in all, Sentimentr allows you to quickly do a sophisticated sentiment analysis and directly use it as an input for your regression or any other further analysis.

This article covers how to get started. If you are looking for advanced analyzing techniques, please refer to other articles such as Tyler Rinker’s Github Repo ’s Readme. For the purpose of this tutorial, I will be analyzing Amazon Reviews on Beauty products from the He & McAuley (2016) Dataset . However, you can easily adapt the code to make it fit your own dataset.

jMFfAvb.jpg!web

Photo by Obi Onyeador on Unsplash

By default, Sentimentr uses the Jockers (2017) dictionary , which should be perfect for most circumstances.

Installing the packages and loading the data

install.packages("sentimentr")
library(sentimentr)

The first two commands install and load the Sentimentr package. Next, I am loading the data. As it is in JSON format, I need to load the ndjson package. I can then use the package’s stream_in function to load the Amazon Beauty Data.

install.packages("ndjson")
library(ndjson)
df = stream_in("AmazonBeauty.json")
head(df)

I also used the head function to quickly look at the first couple of rows of the data. As you will be able to see when performing this on your own machine, there is a column called reviewText that contains the reviews.

Doing the actual sentiment analysis

sentiment=sentiment_by(df$reviewText)

This command runs the sentiment analysis. In this case, I used the sentiment_by command to get an aggregate sentiment measure for the entire review. In other cases, you could use the sentiment command (without _by) to get the sentiment per sentence.

While this command runs (it does take a while), I will discuss what the function will return. The sentiment object in this example will be a data.table including the following columns:

  • element_id — The id number of the review
  • word_count — The word count of the review
  • sd — The standard deviation of the sentiment score of the sentences in the review
  • ave_sentiment — The average sentiment score of the sentences in the review

The most interesting variable is the ave_sentiment , which is the sentiment of the review in one number. The number can take positive or negative values and expresses the valence and the polarity of the sentiment.

Analyzing the Sentiment Scores

We can look at some summary statistics of the calculated sentiment scores.

summary(sentiment$ave_sentiment)

As you can see, most of the reviews tend to be moderately positive, but there are some extreme outliers with the most positive review being 3.44 and the most negative being -1.88. These are quite far away from the mean and the median and one should consider removing them for any further analysis.

I also did a quick histogram to look at the sentiment of the reviews.

library(ggplot2)
qplot(sentiment$ave_sentiment,   geom="histogram",binwidth=0.1,main="Review Sentiment Histogram")

vy6BfqZ.png!web

Integrating your sentiment scores into the original dataset

As I am most interested in the sentiment scores, I will conclude this tutorial by integrating the sentiment scores and their standard deviation back into the main dataset.

df$ave_sentiment=sentiment$ave_sentiment
df$sd_sentiment=sentiment$sd

I hoped this helped.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK