![](/style/images/good.png)
7
![](/style/images/bad.png)
词袋模型
source link: https://ylhao.github.io/2018/05/17/193/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
词袋模型要点
词袋模型不考虑语法、语序等要素,仅仅将文本看作是若干词的集合,文本中每个词的出现是相互独立的。
如何得到词袋模型
- 根据所有文档得到词典
- 每个文档可以得到一个与词典维数相同的向量,可以用稀疏矩阵表示
# encoding: utf-8
import numpy as np
import pandas as pd
import math
from gensim import corpora, models, logging
class my_corpora:
def __init__(self):
self.docs = [] # 每个元素是要一个字典,每个字典对应的每篇文章中每个词出现的次数
self.bows = []
self.word2idx = {} # 键为词,值为每个词对应的索引
self.idx2word = {} # 键为索引,值为词
self.dct = {} # 键为索引,值为词
self.idx = 0 # 当前索引
def dictionary(self, docs):
for doc in docs:
for word in doc:
if word not in self.word2idx:
self.word2idx[word] = self.idx
self.idx2word[self.idx] = word
self.idx += 1
def doc2bow(self, docs):
# 统计每篇文章中每个词出现的次数
for title in docs:
tmp = {}
for word in title:
tmp[self.word2idx[word]] = tmp.get(self.word2idx[word], 0) + 1
self.docs.append(tmp)
# 得到词袋模型
for doc in self.docs:
tmp = []
for k in doc:
tmp.append((k, doc[k]))
self.bows.append(tmp)
# check
for bow in self.bows:
print(bow)
print()
if __name__ == '__main__':
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) # gensim 日志
test_df = pd.read_csv('./test_words_clean.csv', sep=',', encoding='utf-8', skiprows=[0], header=None, names=['id', 'title'])
test_df['title'] = test_df['title'].apply(lambda x: x.split())
# gensim
dct = corpora.Dictionary(test_df['title'])
bows = np.array([dct.doc2bow(title) for title in test_df['title']])
for bow in bows:
print(bow)
print()
# 使用自定义的类实现
corpora_test = my_corpora()
corpora_test.dictionary(test_df['title'])
corpora_test.doc2bow(test_df['title'])
文本文件 test_words_clean.csv
id,title
1,美国 副 总统 彭斯 朝鲜 问题 为 所有 可能 结果 做好 准备 任何 核武器 使用 进行 快速 应对
2,香港 财政 司长 陈茂波 需要 继续 留意 全球 货币 环境 地缘 政治 变化 政策 风险
3,日本央行 理事 雨宫 正佳 退出 宽松 细节 经济 物价 状况 决定
4,日本央行 新任 副行长 日本 存在 通缩 但是 距离 通胀 目标 距离
5,德国 地学 研究 中心 智利 北部 海岸 附近 发生 级 地震
6,美国 财长 努钦 美国 总统 特朗普 朝鲜 最高 领导人 金正恩 会面 条件 朝鲜 无核化 以及 不再 进行 导弹 测试
7,据 韩联社 韩国 总统 文在寅 美国 总统 特朗普 通电 朝鲜 可能 对话 进行 讨论
8,美联储 主席 鲍威尔 预计 通胀 于 中期 稳定
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)]
[(18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1)]
[(33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1)]
[(35, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 2), (51, 1), (52, 1)]
[(53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1)]
[(10, 1), (12, 2), (15, 2), (16, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1)]
[(6, 1), (10, 2), (12, 1), (15, 1), (16, 1), (73, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1)]
[(52, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1)]
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)]
[(18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1)]
[(33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1)]
[(33, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 2), (51, 1), (52, 1)]
[(53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1)]
[(0, 2), (64, 1), (65, 1), (2, 1), (66, 1), (4, 2), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (15, 1), (75, 1), (76, 1)]
[(77, 1), (78, 1), (79, 1), (2, 2), (80, 1), (0, 1), (66, 1), (81, 1), (4, 1), (8, 1), (82, 1), (15, 1), (83, 1)]
[(84, 1), (85, 1), (86, 1), (87, 1), (51, 1), (88, 1), (89, 1), (90, 1)]
转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达,可以在文章下方的评论区进行评论,也可以邮件至 [email protected]
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK