1

TF-IDF

 2 years ago
source link: https://ylhao.github.io/2018/05/17/182/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

词频 TF 计算方式

词频TF=某个词在文章中出现的次数 词频TF=某个词在文章中出现的次数文章的总词数 词频TF=某个词在文章中出现的次数该文出现次数最多的词的出现次数词频TF=某个词在文章中出现的次数词频TF=某个词在文章中出现的次数文章的总词数词频TF=某个词在文章中出现的次数该文出现次数最多的词的出现次数

逆文档频率 IDF 计算方式

逆文档频率IDF=log2(语料库的文档总数包含该词的文档总数+1)逆文档频率IDF=log2(语料库的文档总数包含该词的文档总数+1)

TF-IDF 计算方式

TF−IDF=词频TF×逆文档频率IDFTF−IDF=词频TF×逆文档频率IDF

以下代码用到的计算方式为:
词频TF=某个词在文章中出现的次数 逆文档频率IDF=log2(语料库的文档总数包含该词的文档总数+1) TF−IDF=词频TF×逆文档频率IDF词频TF=某个词在文章中出现的次数逆文档频率IDF=log2(语料库的文档总数包含该词的文档总数+1)TF−IDF=词频TF×逆文档频率IDF

以下代码还基于 TF-IDF 提取了关键词。

# encoding: utf-8

import numpy as np
import pandas as pd
import math
from gensim import corpora, models, logging


class my_corpora:

    def __init__(self):
        self.docs = []  # 每个元素是要一个字典,每个字典对应的每篇文章中每个词出现的次数
        self.bows = []
        self.tfidfs = []
        self.word2idx = {}  # 键为词,值为每个词对应的索引
        self.idx2word = {}  # 键为索引,值为词
        self.idf = {}  # 键为词,值为每个词的 idf 值
        self.dct = {}  # 键为索引,值为词
        self.idx = 0  # 当前索引

    def dictionary(self, docs):
        for doc in docs:
            for word in doc:
                if word not in self.word2idx:
                    self.word2idx[word] = self.idx
                    self.idx2word[self.idx] = word
                    self.idx += 1

    def doc2bow(self, docs):

        # 统计每篇文章中每个词出现的次数
        for title in docs:
            tmp = {}
            for word in title:
                tmp[self.word2idx[word]] = tmp.get(self.word2idx[word], 0) + 1
            self.docs.append(tmp)

        # 得到词袋模型
        for doc in self.docs:
            tmp = []
            for k in doc:
                tmp.append((k, doc[k]))
            self.bows.append(tmp)

    def bow2tfidf(self):

        # 统计包含某词的文档数
        for doc in self.docs:
            for k in doc:
                self.idf[k] = self.idf.get(k, 0) + 1

        # 总文档数
        doc_num = len(self.docs)

        # 计算每个词的逆文档比率
        for k in self.idf:
            self.idf[k] = math.log(doc_num / (self.idf[k] + 1.0), 2)
        for bow in self.bows:
            tmp = []
            for idx, tf in bow:
                tmp.append((idx, tf * self.idf[idx]))
            self.tfidfs.append(tmp)

        for tfidf in self.tfidfs:
            print(tfidf)
        print()

    def extract_keywords(self, topn=3):
        for tfidf in self.tfidfs:
            tfidf = sorted(tfidf, key=lambda item: -item[1])
            key_words = [self.idx2word[idx] for idx, _ in tfidf[:topn]]
            print(key_words)
        print()


if __name__ == '__main__':

    # logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)  # gensim 日志

    test_df = pd.read_csv('./test_words_clean.csv', sep=',', encoding='utf-8', skiprows=[0], header=None, names=['id', 'title'])
    test_df['title'] = test_df['title'].apply(lambda x: x.split())

    # gensim
    dct = corpora.Dictionary(test_df['title'])
    bows = np.array([dct.doc2bow(title) for title in test_df['title']])

    tfidf_model = models.TfidfModel(bows)
    tfidf_vecs = tfidf_model[bows]
    for vec in tfidf_vecs:
        print(vec)
    print()

    for vec in tfidf_vecs:
        vec = sorted(vec, key=lambda item: -item[1])
        key_words = [dct[idx] for idx, _ in vec[:6]]
        print(key_words)
    print()


    # 使用自定义的类实现
    corpora_test = my_corpora()
    corpora_test.dictionary(test_df['title'])
    corpora_test.doc2bow(test_df['title'])
    corpora_test.bow2tfidf()
    corpora_test.extract_keywords(6)

文本文件 test_words_clean.csv

id,title
1,美国 副 总统 彭斯 朝鲜 问题 为 所有 可能 结果 做好 准备 任何 核武器 使用 进行 快速 应对
2,香港 财政 司长 陈茂波 需要 继续 留意 全球 货币 环境 地缘 政治 变化 政策 风险
3,日本央行 理事 雨宫 正佳 退出 宽松 细节 经济 物价 状况 决定
4,日本央行 新任 副行长 日本 存在 通缩 但是 距离 通胀 目标 距离
5,德国 地学 研究 中心 智利 北部 海岸 附近 发生 级 地震
6,美国 财长 努钦 美国 总统 特朗普 朝鲜 最高 领导人 金正恩 会面 条件 朝鲜 无核化 以及 不再 进行 导弹 测试
7,据 韩联社 韩国 总统 文在寅 美国 总统 特朗普 通电 朝鲜 可能 对话 进行 讨论
8,美联储 主席 鲍威尔 预计 通胀 于 中期 稳定
[(0, 0.26412572618001456), (1, 0.26412572618001456), (2, 0.26412572618001456), (3, 0.26412572618001456), (4, 0.26412572618001456), (5, 0.26412572618001456), (6, 0.17608381745334306), (7, 0.26412572618001456), (8, 0.26412572618001456), (9, 0.26412572618001456), (10, 0.12458260235632548), (11, 0.26412572618001456), (12, 0.12458260235632548), (13, 0.26412572618001456), (14, 0.26412572618001456), (15, 0.12458260235632548), (16, 0.12458260235632548), (17, 0.26412572618001456)]
[(18, 0.25819888974716115), (19, 0.25819888974716115), (20, 0.25819888974716115), (21, 0.25819888974716115), (22, 0.25819888974716115), (23, 0.25819888974716115), (24, 0.25819888974716115), (25, 0.25819888974716115), (26, 0.25819888974716115), (27, 0.25819888974716115), (28, 0.25819888974716115), (29, 0.25819888974716115), (30, 0.25819888974716115), (31, 0.25819888974716115), (32, 0.25819888974716115)]
[(33, 0.309426373877638), (34, 0.309426373877638), (35, 0.20628424925175867), (36, 0.309426373877638), (37, 0.309426373877638), (38, 0.309426373877638), (39, 0.309426373877638), (40, 0.309426373877638), (41, 0.309426373877638), (42, 0.309426373877638), (43, 0.309426373877638)]
[(35, 0.1933472978091327), (44, 0.29002094671369905), (45, 0.29002094671369905), (46, 0.29002094671369905), (47, 0.29002094671369905), (48, 0.29002094671369905), (49, 0.29002094671369905), (50, 0.5800418934273981), (51, 0.29002094671369905), (52, 0.1933472978091327)]
[(53, 0.30151134457776363), (54, 0.30151134457776363), (55, 0.30151134457776363), (56, 0.30151134457776363), (57, 0.30151134457776363), (58, 0.30151134457776363), (59, 0.30151134457776363), (60, 0.30151134457776363), (61, 0.30151134457776363), (62, 0.30151134457776363), (63, 0.30151134457776363)]
[(10, 0.12315233159075571), (12, 0.24630466318151142), (15, 0.24630466318151142), (16, 0.12315233159075571), (64, 0.26109343035824584), (65, 0.26109343035824584), (66, 0.26109343035824584), (67, 0.26109343035824584), (68, 0.26109343035824584), (69, 0.26109343035824584), (70, 0.26109343035824584), (71, 0.26109343035824584), (72, 0.26109343035824584), (73, 0.17406228690549722), (74, 0.26109343035824584), (75, 0.26109343035824584), (76, 0.26109343035824584)]
[(6, 0.21690963820959008), (10, 0.30693527202157705), (12, 0.15346763601078853), (15, 0.15346763601078853), (16, 0.15346763601078853), (73, 0.21690963820959008), (77, 0.3253644573143851), (78, 0.3253644573143851), (79, 0.3253644573143851), (80, 0.3253644573143851), (81, 0.3253644573143851), (82, 0.3253644573143851), (83, 0.3253644573143851)]
[(52, 0.24433888871261045), (84, 0.36650833306891567), (85, 0.36650833306891567), (86, 0.36650833306891567), (87, 0.36650833306891567), (88, 0.36650833306891567), (89, 0.36650833306891567), (90, 0.36650833306891567)]

['为', '任何', '使用', '做好', '准备', '副']
['全球', '变化', '司长', '地缘', '政治', '政策']
['决定', '宽松', '正佳', '物价', '状况', '理事']
['距离', '但是', '副行长', '存在', '新任', '日本']
['中心', '北部', '发生', '地学', '地震', '德国']
['不再', '以及', '会面', '努钦', '导弹', '无核化']
['对话', '据', '文在寅', '讨论', '通电', '韩国']
['中期', '主席', '于', '稳定', '美联储', '预计']

[(0, 1.0), (1, 2.0), (2, 1.0), (3, 2.0), (4, 1.0), (5, 2.0), (6, 2.0), (7, 2.0), (8, 1.4150374992788437), (9, 2.0), (10, 2.0), (11, 2.0), (12, 2.0), (13, 2.0), (14, 2.0), (15, 1.0), (16, 2.0), (17, 2.0)]
[(18, 2.0), (19, 2.0), (20, 2.0), (21, 2.0), (22, 2.0), (23, 2.0), (24, 2.0), (25, 2.0), (26, 2.0), (27, 2.0), (28, 2.0), (29, 2.0), (30, 2.0), (31, 2.0), (32, 2.0)]
[(33, 1.4150374992788437), (34, 2.0), (35, 2.0), (36, 2.0), (37, 2.0), (38, 2.0), (39, 2.0), (40, 2.0), (41, 2.0), (42, 2.0), (43, 2.0)]
[(33, 1.4150374992788437), (44, 2.0), (45, 2.0), (46, 2.0), (47, 2.0), (48, 2.0), (49, 2.0), (50, 4.0), (51, 1.4150374992788437), (52, 2.0)]
[(53, 2.0), (54, 2.0), (55, 2.0), (56, 2.0), (57, 2.0), (58, 2.0), (59, 2.0), (60, 2.0), (61, 2.0), (62, 2.0), (63, 2.0)]
[(0, 2.0), (64, 2.0), (65, 2.0), (2, 1.0), (66, 1.4150374992788437), (4, 2.0), (67, 2.0), (68, 2.0), (69, 2.0), (70, 2.0), (71, 2.0), (72, 2.0), (73, 2.0), (74, 2.0), (15, 1.0), (75, 2.0), (76, 2.0)]
[(77, 2.0), (78, 2.0), (79, 2.0), (2, 2.0), (80, 2.0), (0, 1.0), (66, 1.4150374992788437), (81, 2.0), (4, 1.0), (8, 1.4150374992788437), (82, 2.0), (15, 1.0), (83, 2.0)]
[(84, 2.0), (85, 2.0), (86, 2.0), (87, 2.0), (51, 1.4150374992788437), (88, 2.0), (89, 2.0), (90, 2.0)]

['副', '彭斯', '问题', '为', '所有', '结果']
['香港', '财政', '司长', '陈茂波', '需要', '继续']
['理事', '雨宫', '正佳', '退出', '宽松', '细节']
['距离', '新任', '副行长', '日本', '存在', '通缩']
['德国', '地学', '研究', '中心', '智利', '北部']
['美国', '财长', '努钦', '朝鲜', '最高', '领导人']
['据', '韩联社', '韩国', '总统', '文在寅', '通电']
['美联储', '主席', '鲍威尔', '预计', '于', '中期']

  1. TF-IDF与余弦相似性的应用(一):自动提取关键词 —— 阮一峰

转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达,可以在文章下方的评论区进行评论,也可以邮件至 [email protected]

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK