基于深度学习的文本分类案例：使用LSTM进行情绪分类 - 孤飞 - JOYK Joy of Geek, Geek News, Link all geek

Sentiment classification using LSTM

在这个笔记本中，我们将使用LSTM架构在电影评论数据集上训练一个模型来预测评论的情绪。首先，让我们看看什么是LSTM？

LSTM，即长短时记忆，是一种序列神经网络架构，它利用其结构保留了对前一序列的记忆。第一个被引入的序列模型是RNN。但是，很快研究人员发现，RNN并没有保留很多以前序列的记忆。这导致在长文本序列中失去上下文。

为了维护这一背景，LSTM被引入。在LSTM单元中，有一些特殊的结构被称为门和单元状态，它们被改变和维护以保持LSTM中的记忆。要了解这些结构如何工作，请阅读 this blog.

从代码上看，我们正在使用tensorflow和keras来建立模型和训练它。为了进一步了解本项目的代码/概念，我们使用了以下参考资料。

References:

(1) Medium article on keras lstm

(2) Keras embedding layer documentation

(3) Keras example of text classification from scratch

(4) Bi-directional lstm model example

(5) kaggle notebook for text preprocessing

Notebook:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/sentiment-analysis-on-movie-reviews/sampleSubmission.csv
/kaggle/input/sentiment-analysis-on-movie-reviews/train.tsv.zip
/kaggle/input/sentiment-analysis-on-movie-reviews/test.tsv.zip

1
2
3
train_data = pd.read_csv('/kaggle/input/sentiment-analysis-on-movie-reviews/train.tsv.zip',sep = '\t')
test_data = pd.read_csv('/kaggle/input/sentiment-analysis-on-movie-reviews/train.tsv.zip',sep = '\t')
train_data.head()

PhraseId	SentenceId	Phrase	Sentiment
0	1	1	A series of escapades demonstrating the adage ...	1
1	2	1	A series of escapades demonstrating the adage ...	2
2	3	1	A series	2
3	4	1	A	2
4	5	1	series	2

1
2
train_data = train_data.drop(['PhraseId','SentenceId'],axis = 1)
test_data = test_data.drop(['PhraseId','SentenceId'],axis = 1)

1
2
3
4
5
6
7
import keras
from keras.models import Sequential
from keras.layers import Dense #层lyer
from keras.layers import LSTM
from keras.layers import Activation
from keras.layers import Embedding
from keras.layers import Bidirectional

1
2
max_features = 20000  # 只考虑前20千字
maxlen = 200

train_data.head()

Phrase	Sentiment
0	A series of escapades demonstrating the adage ...	1
1	A series of escapades demonstrating the adage ...	2
2	A series	2
3	A	2
4	series	2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from nltk.corpus import stopwords
import re
# 定义文本清理函数
def text_cleaning(text):
    forbidden_words = set(stopwords.words('english'))#停用词，对于理解文章没有太大意义的词，比如"the"、“an”、“his”、“their”
    if text:
        text = ' '.join(text.split('.'))
        text = re.sub('\/',' ',text)
        text = re.sub(r'\\',' ',text)
        text = re.sub(r'((http)\S+)','',text)
        text = re.sub(r'\s+', ' ', re.sub('[^A-Za-z]', ' ', text.strip().lower())).strip()
        text = re.sub(r'\W+', ' ', text.strip().lower()).strip()
        text = [word for word in text.split() if word not in forbidden_words]
        return text
    return []

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 将句子转化为词语列表
train_data['flag'] = 'TRAIN'
test_data['flag'] = 'TEST'
total_docs = pd.concat([train_data,test_data],axis = 0,ignore_index = True)
total_docs['Phrase'] = total_docs['Phrase'].apply(lambda x: ' '.join(text_cleaning(x)))
phrases = total_docs['Phrase'].tolist()
from keras.preprocessing.text import one_hot
vocab_size = 50000
encoded_phrases = [one_hot(d, vocab_size) for d in phrases]
total_docs['Phrase'] = encoded_phrases
train_data = total_docs[total_docs['flag'] == 'TRAIN']
test_data = total_docs[total_docs['flag'] == 'TEST']
x_train = train_data['Phrase']
y_train = train_data['Sentiment']
x_val = test_data['Phrase']
y_val = test_data['Sentiment']

x_train.head()

y_train.unique()

array([1, 2, 3, 4, 0])

tf.keras.preprocessing.sequence.pad_sequences()的用法：https://blog.csdn.net/qq_45465526/article/details/109400926)

1
2
3
# 将序列转化为经过填充以后得到的一个长度相同新的序列
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.preprocessing.sequence.pad_sequences(x_val, maxlen=maxlen)

1
2
3
4
5
6
7
8
9
10
11
12
model = Sequential()
inputs = keras.Input(shape=(None,), dtype="int32")
# 将每个整数嵌入一个128维的向量中
model.add(inputs)
model.add(Embedding(50000, 128))
# 增加2个双向的LSTM
model.add(Bidirectional(LSTM(64, return_sequences=True)))
model.add(Bidirectional(LSTM(64)))
# 添加一个分类器
model.add(Dense(5, activation="sigmoid"))
#model = keras.Model(inputs, outputs)
model.summary()

result:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 128)         6400000   
_________________________________________________________________
bidirectional (Bidirectional (None, None, 128)         98816     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               98816     
_________________________________________________________________
dense (Dense)                (None, 5)                 645       
=================================================================
Total params: 6,598,277
Trainable params: 6,598,277
Non-trainable params: 0
_________________________________________________________________

1
2
model.compile("adam", "sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=30, validation_data=(x_val, y_val))

result:

1
2
3
4
5
6
7
8
9
Epoch 1/30
4877/4877 [==============================] - 562s 115ms/step - loss: 0.9593 - accuracy: 0.6107 - val_loss: 0.7819 - val_accuracy: 0.6798
Epoch 2/30
4877/4877 [==============================] - 520s 107ms/step - loss: 0.7942 - accuracy: 0.6729 - val_loss: 0.7094 - val_accuracy: 0.7114
.....................................................................
Epoch 29/30
4877/4877 [==============================] - 539s 111ms/step - loss: 0.3510 - accuracy: 0.8117 - val_loss: 0.3220 - val_accuracy: 0.8242
Epoch 30/30
4877/4877 [==============================] - 553s 113ms/step - loss: 0.3485 - accuracy: 0.8124 - val_loss: 0.3187 - val_accuracy: 0.8238

<tensorflow.python.keras.callbacks.History at 0x7fa9b82520d0>

model.fit(x_train, y_train, batch_size=32, epochs=5, validation_data=(x_val, y_val))

result:

1
2
3
4
5
6
7
8
9
10
Epoch 1/5
4877/4877 [==============================] - 535s 110ms/step - loss: 0.3477 - accuracy: 0.8128 - val_loss: 0.3193 - val_accuracy: 0.8240
Epoch 2/5
4877/4877 [==============================] - 543s 111ms/step - loss: 0.3457 - accuracy: 0.8134 - val_loss: 0.3173 - val_accuracy: 0.8250
Epoch 3/5
4877/4877 [==============================] - 542s 111ms/step - loss: 0.3428 - accuracy: 0.8140 - val_loss: 0.3158 - val_accuracy: 0.8254
Epoch 4/5
4877/4877 [==============================] - 541s 111ms/step - loss: 0.3429 - accuracy: 0.8144 - val_loss: 0.3165 - val_accuracy: 0.8257
Epoch 5/5
4877/4877 [==============================] - 557s 114ms/step - loss: 0.3395 - accuracy: 0.8150 - val_loss: 0.3136 - val_accuracy: 0.8259

<tensorflow.python.keras.callbacks.History at 0x7fa8e0763150>

总之，我们创建了一个双向的LSTM模型，并对其进行了检测情感的训练。我们达到了80%的训练和82%的验证准确率。
Notebook code：https://www.kaggle.com/code/ranxi169/sentiment-classification-using-lstm/notebook
原创作者：孤飞-博客园
个人博客：https://blog.onefly.top

基于深度学习的文本分类案例：使用LSTM进行情绪分类 - 孤飞

Sentiment classification using LSTM

References:

Notebook:

Recommend

Tips To Help You Master the Retro Web Design Trend

Apple issues iOS 16.0.2 update to fix camera, hard reset bugs

Denialists, Alarmists, and Doomists

割经销商韭菜，又被经销商所害：这个行业，风险到底有多大？

Security Capital首席执行官：加密市场在美联储加息前提前调整，其波动性高于股票

CloudEats获得700万美元融资

LXC and LXD: a different container story

司马徒林：电子古董才是数字时代的特效药|史蒂夫·乔布斯|复古|diy|个人电脑_网易订阅

一诺威收北交所三轮审核问询函，税务合规性等受关注

HistoryDAO：去中心化《史记》

About Joyk