词袋模型

词袋模型要点

词袋模型不考虑语法、语序等要素，仅仅将文本看作是若干词的集合，文本中每个词的出现是相互独立的。

如何得到词袋模型

根据所有文档得到词典
每个文档可以得到一个与词典维数相同的向量，可以用稀疏矩阵表示

# encoding: utf-8

import numpy as np
import pandas as pd
import math
from gensim import corpora, models, logging


class my_corpora:

    def __init__(self):
        self.docs = []  # 每个元素是要一个字典，每个字典对应的每篇文章中每个词出现的次数
        self.bows = []
        self.word2idx = {}  # 键为词，值为每个词对应的索引
        self.idx2word = {}  # 键为索引，值为词
        self.dct = {}  # 键为索引，值为词
        self.idx = 0  # 当前索引

    def dictionary(self, docs):
        for doc in docs:
            for word in doc:
                if word not in self.word2idx:
                    self.word2idx[word] = self.idx
                    self.idx2word[self.idx] = word
                    self.idx += 1

    def doc2bow(self, docs):

        # 统计每篇文章中每个词出现的次数
        for title in docs:
            tmp = {}
            for word in title:
                tmp[self.word2idx[word]] = tmp.get(self.word2idx[word], 0) + 1
            self.docs.append(tmp)

        # 得到词袋模型
        for doc in self.docs:
            tmp = []
            for k in doc:
                tmp.append((k, doc[k]))
            self.bows.append(tmp)

        # check
        for bow in self.bows:
            print(bow)
        print()

if __name__ == '__main__':

    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)  # gensim 日志

    test_df = pd.read_csv('./test_words_clean.csv', sep=',', encoding='utf-8', skiprows=[0], header=None, names=['id', 'title'])
    test_df['title'] = test_df['title'].apply(lambda x: x.split())

    # gensim
    dct = corpora.Dictionary(test_df['title'])
    bows = np.array([dct.doc2bow(title) for title in test_df['title']])
    for bow in bows:
        print(bow)
    print()

    # 使用自定义的类实现
    corpora_test = my_corpora()
    corpora_test.dictionary(test_df['title'])
    corpora_test.doc2bow(test_df['title'])

文本文件 test_words_clean.csv

id,title
1,美国 副 总统 彭斯 朝鲜 问题 为 所有 可能 结果 做好 准备 任何 核武器 使用 进行 快速 应对
2,香港 财政 司长 陈茂波 需要 继续 留意 全球 货币 环境 地缘 政治 变化 政策 风险
3,日本央行 理事 雨宫 正佳 退出 宽松 细节 经济 物价 状况 决定
4,日本央行 新任 副行长 日本 存在 通缩 但是 距离 通胀 目标 距离
5,德国 地学 研究 中心 智利 北部 海岸 附近 发生 级 地震
6,美国 财长 努钦 美国 总统 特朗普 朝鲜 最高 领导人 金正恩 会面 条件 朝鲜 无核化 以及 不再 进行 导弹 测试
7,据 韩联社 韩国 总统 文在寅 美国 总统 特朗普 通电 朝鲜 可能 对话 进行 讨论
8,美联储 主席 鲍威尔 预计 通胀 于 中期 稳定

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)]
[(18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1)]
[(33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1)]
[(35, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 2), (51, 1), (52, 1)]
[(53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1)]
[(10, 1), (12, 2), (15, 2), (16, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1)]
[(6, 1), (10, 2), (12, 1), (15, 1), (16, 1), (73, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1)]
[(52, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1)]

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)]
[(18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1)]
[(33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1)]
[(33, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 2), (51, 1), (52, 1)]
[(53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1)]
[(0, 2), (64, 1), (65, 1), (2, 1), (66, 1), (4, 2), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (15, 1), (75, 1), (76, 1)]
[(77, 1), (78, 1), (79, 1), (2, 2), (80, 1), (0, 1), (66, 1), (81, 1), (4, 1), (8, 1), (82, 1), (15, 1), (83, 1)]
[(84, 1), (85, 1), (86, 1), (87, 1), (51, 1), (88, 1), (89, 1), (90, 1)]

转载请注明来源，欢迎对文章中的引用来源进行考证，欢迎指出任何有错误或不够清晰的表达，可以在文章下方的评论区进行评论，也可以邮件至 [email protected]

词袋模型要点

如何得到词袋模型

文本文件 test_words_clean.csv

Recommend

[翻译]Kaldi中的解码图构建过程-可视化教程

Git Hook to Add Issue Number to Commit Message

影刀RPA完成高盛领投的1亿美元C轮融资

基于TimeLine模型的消息同步机制

双月记202006-07

Python学习笔记之简介

SuperEdge——不一样的边缘计算

互联网推荐理财书籍整理（提供下载）

pandasql库解析

star-history源码阅读笔记(02)-搜索栏+大图的HTML与CSS排版

About Joyk