0

【NLP】基本数据结构

 2 years ago
source link: https://www.guofei.site/2021/10/17/nlp_data_structure.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

【NLP】基本数据结构

2021年10月17日

Author: Guofei

文章归类: 2-4-NLP ,文章编号: 341


版权声明:本文作者是郭飞。转载随意,但需要标明原文链接,并通知本人
原文链接:https://www.guofei.site/2021/10/17/nlp_data_structure.html

Edit

任务:我们事先有个词典,任务是给定一句话,找出这句话中所有在词典中的词语。

例如“商品和服务”,应当提取出 “商品”,“和”,“和服”,“服务”

用最粗暴的遍历当然可以解决,为了提升性能,显然有些做法:

trie

trie

这个版本是我见过实现最好的

class Node(object):
    def __init__(self, value) -> None:
        self._children = {}
        self._value = value

    def _add_child(self, char, value, overwrite=False):
        child = self._children.get(char)
        if child is None:
            child = Node(value)
            self._children[char] = child
        elif overwrite:
            child._value = value
        return child


class Trie(Node):
    def __init__(self) -> None:
        super().__init__(None)

    def __contains__(self, key):
        return self[key] is not None

    def __getitem__(self, key):
        state = self
        for char in key:
            state = state._children.get(char)
            if state is None:
                return None
        return state._value

    def __setitem__(self, key, value):
        state = self
        for i, char in enumerate(key):
            if i < len(key) - 1:
                state = state._add_child(char, None, False)
            else:
                state = state._add_child(char, value, True)
trie = Trie()
# 增
trie['自然'] = 'nature'
trie['自然人'] = 'human'
trie['自然语言'] = 'language'
trie['自语'] = 'talk	to oneself'
trie['入门'] = 'introduction'
assert '自然' in trie
# 删
trie['自然'] = None
assert '自然' not in trie
# 改
trie['自然语言'] = 'human language'
assert trie['自然语言'] == 'human language'
# 查
assert trie['入门'] == 'introduction'

改进1

上面的实现中,我们用 dict 来查找下一个字符,这里面有个问题

  • dict 对应的 hash 是针对字符串而设计的,它是64位hash
  • 而我们的数据只有 char
  • 因此浪费了大量性能

改进2

如果 “自然” 这条路径不存在,那么 “自然语言处理” 这个路径一定不存在,就没必要往下查询了。


您的支持将鼓励我继续创作!

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK