5

使用scikit-learn进行电影评论情感分类

 2 years ago
source link: http://yphuang.github.io/blog/2016/04/21/Sentiment-Analysis-Using-sklearn/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

使用scikit-learn进行电影评论情感分类

从网站Movie Review Data下载语料。这里选择polarity dataset v2.0。该数据集包含正负情感极性(posneg)的电影评论各1000条。

下面,进行数据载入,并进行训练集/测试集划分。

# load library
import os 
import sys

# set working directory
os.chdir("D:\\my_python_workfile\\Thesis\\movie_review\\review_polarity\\txt_sentoken")

dataset_dir_name = os.getcwd()
dataset_dir_name
'D:\\my_python_workfile\\Thesis\\movie_review\\review_polarity\\txt_sentoken'
# load library
import numpy as np
from sklearn.datasets import load_files
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# load data,and split into training/test set
movie_reviews = load_files(dataset_dir_name)
 # split 
doc_terms_train,doc_terms_test,doc_class_train,doc_class_test =  train_test_split(
        movie_reviews.data,movie_reviews.target,test_size = 0.2,random_state = None)
    

len(doc_class_train),len(doc_class_test),(movie_reviews.target_names)

#print("\n".join(movie_reviews.data[0].split("\n"))[:20])
(1600, 400, ['neg', 'pos'])

建立vectorizer/classifier pipeline

# build a vectorizer/classifier pipeline
pipeline = Pipeline([
        ('vect', TfidfVectorizer(min_df=3, max_df=0.95)),
        ('clf', LinearSVC(C=1000)),
    ])
# grid search
parameters = {
        'vect__ngram_range': [(1, 1), (1, 2)],
    }
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1)
grid_search.fit(doc_terms_train,doc_class_train)

print(grid_search.grid_scores_)
[mean: 0.83750, std: 0.01659, params: {'vect__ngram_range': (1, 1)}, mean: 0.85938, std: 0.01338, params: {'vect__ngram_range': (1, 2)}]

模型预测效果评估

# y_predicted
y_predicted = grid_search.predict(doc_terms_test)

# report
print(metrics.classification_report(doc_class_test,y_predicted,
                                   target_names = movie_reviews.target_names))
             precision    recall  f1-score   support

        neg       0.85      0.85      0.85       188
        pos       0.87      0.86      0.87       212

avg / total       0.86      0.86      0.86       400
# confusion matrix
confusion_matrix = metrics.confusion_matrix(doc_class_test,y_predicted)
print(confusion_matrix)
[[160  28]
 [ 29 183]]

以上作为一个入门级的介绍,就到此为止啦~

当然,在现实生活中,我们不能仅仅满足于对电影评论的正负面分类,而应该考虑更细粒度的分类问题。比如电影评论文本分为1~5星,1星和2星之间比1星和5星更为相似,所以这种多分类问题可以看做是ordinal regression问题求解(见参考文献Pang B等)。

正好kaggle上有一个更细粒度的情感分类问题:Sentiment Analysis on Movie Reviews。对情感分析感兴趣的同学,可以捋起袖子,来一场Kaggle的比赛了。



About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK