6

【NLP】学习笔记,进行中。。。。

 3 years ago
source link: https://www.guofei.site/2021/05/01/nlp.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

【NLP】学习笔记,进行中。。。。

2021年05月01日

Author: Guofei

文章归类: 2-4-NLP ,文章编号: 341


版权声明:本文作者是郭飞。转载随意,但需要标明原文链接,并通知本人
原文链接:https://www.guofei.site/2021/05/01/nlp.html

Edit

语料库和数据集

分类:孤立语料库(就是文本集合)、分类语料库(语料+分类信息)、重叠语料库(有分类,但分类有重叠现象)

基本数据类型:离散、连续等,不多说。

语料文件格式:txt、csv、xml、json、稀疏矩阵等

如何获取语料库:

  • nltk: dir(nltk.corpus)
  • https://github.com/awesomedata/awesome-public-datasets
  • https://www.kaggle.com/datasets
  • https://www.reddit.com/r/datasets/
  • 自己写爬虫

nltk

下载语料库

nltk.download('brown')  
# 上面可能网络原因失败,可以手动下载:
# https://github.com/nltk/nltk_data/tree/gh-pages
# 然后把 packages 文件夹下的所有内容复制到指定目录即可
# 为了使用其它功能,还需要:
# 解压缩 /nltk_data/tokenizers/punkt.zip

使用语料库

from nltk.corpus import brown

brown.categories()  # 语料库类别
brown.fileids()  # 语料库文件
brown.categories(fileids)  # 根据文件返回类别
brown.fileids(categories)  # 根据类别返回文件



brown.raw()  #  返回语料, str 类型
brown.words()  # 语料中的单词
brown.sents()  # 语料的句子
# raw,words,sents 可以指定 fileids=[f1, f2, f3],categories=[c1, c2]。也可以用 index 来索引,例如:
brown.words(categories='news')[10:30]   



brown.abspath(fileid='ca01')  # 文件路径

brown.readme()  # readme

文本预处理

  • 分句。不是简单寻找句号,因为点号有时候用在缩写里面。可以使用一些工具,或搭建ML模型来做。
  • 词形还原,例如比较级、过去式、现在式
  • 停用词去除
  • 拼写校正。

分句

import nltk

text = "How long does it take to get a Ph.D. degree? I don't know."
nltk.tokenize.sent_tokenize(text)
# 另一个可训练的分句工具
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

分词

import nltk

text = 'Hello! Everyone!! Good morning!'
print(nltk.tokenize.word_tokenize(text=text))  # 最通用的方法
print(text.split())  # 朴素的方法
print(nltk.tokenize.regexp_tokenize(text=text, pattern='\w+'))  # 允许用户自定义
print(nltk.tokenize.wordpunct_tokenize(text=text))
print(nltk.tokenize.blankline_tokenize(text=text))

词形还原

如果事先不知道词性,可以用词干提取(stemming),如果事先知道词性,用词形还原(lemmatization)

stemming:

import nltk

pst = nltk.stem.PorterStemmer()
lst = nltk.stem.LancasterStemmer()
sbt = nltk.stem.SnowballStemmer(language='english')  # 可以支持多种语言

pst.stem('eating')
lst.stem('shopping')
sbt.stem('asked')

lemmatization:

from nltk.stem.wordnet import WordNetLemmatizer

wordlemma = WordNetLemmatizer()
print(wordlemma.lemmatize('cars'))
print(wordlemma.lemmatize('walking', pos='v'))
print(wordlemma.lemmatize('meeting', pos='n'))
print(wordlemma.lemmatize('meeting', pos='v'))
print(wordlemma.lemmatize('better', pos='a'))
print(wordlemma.lemmatize('is', pos='v'))
print(wordlemma.lemmatize('funnier', pos='a'))
print(wordlemma.lemmatize('expected', pos='v'))
print(wordlemma.lemmatize('fantasized', pos='v'))

停用词移除

移除停用词对语义影响极小,而且停用词本身数量较多,干扰模型。

from nltk.corpus import stopwords

stop_words = stopwords.words('english')  # 支持几十种语言
stop_words = set(["hi", "bye"])  #  也可以自定义停用词

# 然后就可以移除了
tokens = [word for word in text.split() if word not in stopwords]

罕见词移除

freq_dist = nltk.FreqDist(tokens)  # 类似 Counter 的词频统计
freq_dist1 = sorted(freq_dist.items(), key=lambda x: x[1], reverse=True)
rare_words = [i[0] for i in freq_dist1[-10:]]
# 然后就可以移除了
tokens = [word for word in text.split() if word not in rare_words]

拼写校正算法

使用最小编辑距离(也就是插入、删除、替换所需操作)做为标准,用动态规划来实现。

import re
from collections import Counter

# WORDS 一般可以从语料中获取,这里为了展示代码临时写成这样
WORDS = Counter(['apple', 'correction', 'statement', 'tutors'])


def P(word, N=sum(WORDS.values())):
    "Probability of `word`."
    return WORDS[word] / N


def correction(word):
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)


def candidates(word):
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])


def known(words):
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)


def edits1(word):
    "All edits that are one edit away from `word`."
    letters = 'abcdefghijklmnopqrstuvwxyz'
    splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    deletes = [L + R[1:] for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
    replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
    inserts = [L + c + R for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)


def edits2(word):
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))


if __name__ == "__main__":
    print(correction('aple'))
    print(correction('correcton'))
    print(correction('statament'))
    print(correction('tutpore'))

nltk自带一个距离算法

from nltk.metrics import edit_distance
edit_distance('aple', 'apple')

特征工程

NLP的基础特征

  • 实体识别(NER)
  • n-gram
  • Bag-of-Word

NLP 的统计特征

  • One-hot 等
  • TF-IDF

NLP 的高级特征

  • word2vec

词性标注

pos标注器(使用最大熵分类器)

text = 'While in France, Christine Lagarde discussed short-term ' \
       'stimulus efforts in a recent interview at 5:00 P.M with the Wall Street Journal.'
nltk.pos_tag(nltk.word_tokenize(text=text))

Stanford 标注器:

# 下载模型:https://nlp.stanford.edu/software/tagger.shtml ,我是下载好后放到 Dowdloads 下面了

from nltk.tag import StanfordPOSTagger
from nltk import word_tokenize

jar = '/Users/guofei/Downloads/stanford-postagger-full-2020-11-17/stanford-postagger.jar'
model = '/Users/guofei/Downloads/stanford-postagger-full-2020-11-17/models/english-left3words-distsim.tagger'

pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8')


text = 'While in France, Christine Lagarde discussed short-term ' \
       'stimulus efforts in a recent interview at 5:00 P.M with the Wall Street Journal.'

tokenized_text=word_tokenize(text)
classified_text = pos_tagger.tag(tokenized_text)

print(classified_text)

中文也支持

from nltk.tag import StanfordPOSTagger
import jieba
jar = '/Users/guofei/Downloads/stanford-postagger-full-2020-11-17/stanford-postagger.jar'
model = '/Users/guofei/Downloads/stanford-postagger-full-2020-11-17/models/chinese-nodistsim.tagger'
model = '/Users/guofei/Downloads/stanford-postagger-full-2020-11-17/models/chinese-distsim.tagger'
pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8')
text='张伟在北京上班,李雷在南京逛街'
tokenized_text = list(jieba.cut(text))
classified_text = pos_tagger.tag(tokenized_text)

N-gram标注器,有几种:

from nltk.corpus import brown
import nltk

brown_tagged_sents = brown.tagged_sents()
train_data = brown_tagged_sents[:int(0.9 * len(brown_tagged_sents))]
test_data = brown_tagged_sents[int(0.9 * len(brown_tagged_sents)):]

default_tagger = nltk.DefaultTagger('NN')  # NN 最多,默认设为NN,准确率就可达到 13%
default_tagger.evaluate(brown_tagged_sents)

unigram_tagger = nltk.tag.UnigramTagger(train=train_data, backoff=default_tagger)
bigram_tagger = nltk.tag.BigramTagger(train=train_data, backoff=unigram_tagger)
trigram_tagger = nltk.tag.TrigramTagger(train=train_data, backoff=bigram_tagger)


print(unigram_tagger.evaluate(test_data))  # 89%
print(bigram_tagger.evaluate(test_data))  # 91%
print(trigram_tagger.evaluate(test_data))  # 91%

NER标注器

import nltk

text = 'London is a big city in the United Kingdom.'

nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text=text)), binary=False)
#

文本结构解析

规则式NLP

什么情况下使用?

  • 专家经验容易变成规则
  • 数据规模小
  • 准确率要求高
  • 无需覆盖很大场景

参考资料

Python自然语言处理(Jalaj Thanaki)


您的支持将鼓励我继续创作!

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK