7

词袋模型

 2 years ago
source link: https://ylhao.github.io/2018/05/17/193/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

词袋模型要点

词袋模型不考虑语法、语序等要素,仅仅将文本看作是若干词的集合,文本中每个词的出现是相互独立的。

如何得到词袋模型

  1. 根据所有文档得到词典
  2. 每个文档可以得到一个与词典维数相同的向量,可以用稀疏矩阵表示
# encoding: utf-8

import numpy as np
import pandas as pd
import math
from gensim import corpora, models, logging


class my_corpora:

    def __init__(self):
        self.docs = []  # 每个元素是要一个字典,每个字典对应的每篇文章中每个词出现的次数
        self.bows = []
        self.word2idx = {}  # 键为词,值为每个词对应的索引
        self.idx2word = {}  # 键为索引,值为词
        self.dct = {}  # 键为索引,值为词
        self.idx = 0  # 当前索引

    def dictionary(self, docs):
        for doc in docs:
            for word in doc:
                if word not in self.word2idx:
                    self.word2idx[word] = self.idx
                    self.idx2word[self.idx] = word
                    self.idx += 1

    def doc2bow(self, docs):

        # 统计每篇文章中每个词出现的次数
        for title in docs:
            tmp = {}
            for word in title:
                tmp[self.word2idx[word]] = tmp.get(self.word2idx[word], 0) + 1
            self.docs.append(tmp)

        # 得到词袋模型
        for doc in self.docs:
            tmp = []
            for k in doc:
                tmp.append((k, doc[k]))
            self.bows.append(tmp)

        # check
        for bow in self.bows:
            print(bow)
        print()

if __name__ == '__main__':

    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)  # gensim 日志

    test_df = pd.read_csv('./test_words_clean.csv', sep=',', encoding='utf-8', skiprows=[0], header=None, names=['id', 'title'])
    test_df['title'] = test_df['title'].apply(lambda x: x.split())

    # gensim
    dct = corpora.Dictionary(test_df['title'])
    bows = np.array([dct.doc2bow(title) for title in test_df['title']])
    for bow in bows:
        print(bow)
    print()

    # 使用自定义的类实现
    corpora_test = my_corpora()
    corpora_test.dictionary(test_df['title'])
    corpora_test.doc2bow(test_df['title'])

文本文件 test_words_clean.csv

id,title
1,美国 副 总统 彭斯 朝鲜 问题 为 所有 可能 结果 做好 准备 任何 核武器 使用 进行 快速 应对
2,香港 财政 司长 陈茂波 需要 继续 留意 全球 货币 环境 地缘 政治 变化 政策 风险
3,日本央行 理事 雨宫 正佳 退出 宽松 细节 经济 物价 状况 决定
4,日本央行 新任 副行长 日本 存在 通缩 但是 距离 通胀 目标 距离
5,德国 地学 研究 中心 智利 北部 海岸 附近 发生 级 地震
6,美国 财长 努钦 美国 总统 特朗普 朝鲜 最高 领导人 金正恩 会面 条件 朝鲜 无核化 以及 不再 进行 导弹 测试
7,据 韩联社 韩国 总统 文在寅 美国 总统 特朗普 通电 朝鲜 可能 对话 进行 讨论
8,美联储 主席 鲍威尔 预计 通胀 于 中期 稳定
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)]
[(18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1)]
[(33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1)]
[(35, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 2), (51, 1), (52, 1)]
[(53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1)]
[(10, 1), (12, 2), (15, 2), (16, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1)]
[(6, 1), (10, 2), (12, 1), (15, 1), (16, 1), (73, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1)]
[(52, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1)]

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)]
[(18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1)]
[(33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1)]
[(33, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 2), (51, 1), (52, 1)]
[(53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1)]
[(0, 2), (64, 1), (65, 1), (2, 1), (66, 1), (4, 2), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (15, 1), (75, 1), (76, 1)]
[(77, 1), (78, 1), (79, 1), (2, 2), (80, 1), (0, 1), (66, 1), (81, 1), (4, 1), (8, 1), (82, 1), (15, 1), (83, 1)]
[(84, 1), (85, 1), (86, 1), (87, 1), (51, 1), (88, 1), (89, 1), (90, 1)]

转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达,可以在文章下方的评论区进行评论,也可以邮件至 [email protected]

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK