【自然语言处理（NLP）】聊天机器人模块实现

作者简介：在校大学生一枚，华为云享专家，阿里云专家博主，腾云先锋（TDP）成员，云曦智划项目总负责人，全国高等学校计算机教学与产业实践资源建设专家委员会（TIPCC）志愿者，以及编程爱好者，期待和大家一起学习，一起进步~
.
博客主页： ぃ灵彧が的学习日志
.
本文专栏： 人工智能
.
专栏寄语：若你决定灿烂，山无遮，海无拦
.

(一)、任务描述

使用PaddleNLP内置的生成式API的功能和用法，并使用PaddleNLP内置的plato-mini模型和配置的生成式API实现一个简单的闲聊机器人。

(二)、环境配置

本示例基于飞桨开源框架2.0版本。

import paddle
import paddle.nn.functional as F
import re
import numpy as np

print(paddle.__version__)

# cpu/gpu环境选择，在 paddle.set_device() 输入对应运行设备。
# device = paddle.set_device('gpu')

输出结果如下图1所示：
【自然语言处理（NLP）】聊天机器人模块实现_自然语言处理_02

一、下载并更新相关包

AI Studio平台已经默认安装了PaddleNLP，但仍然需要使用如下的指令进行版本的更新，否则后续程序的运行会报错。

!pip install --upgrade paddlenlp -i https://pypi.org/simple
!pip install --upgrade pip
!pip install --upgrade sentencepiece

二、使用生成API实现闲聊机器人

下面我们来学习如何使用UnifiedTransformer模型及其内嵌的生成式API实现一个闲聊机器人。

(一)、数据处理

from paddlenlp.transformers import UnifiedTransformerTokenizer

# 设置想要使用的模型名称
model_name = 'plato-mini'
tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name)

user_input = ['你好啊，你今年多大了']

#调用dialogue_encode方法生成模型输入
encoded_input = tokenizer.dialogue_encode(
    user_input,
    add_start_token_as_response = True,
    return_tensors = True,
    is_split_into_words=False
)

print(encoded_input.keys())
# dict_keys(['input_ids','token_type_ids','position_ids','attention_mask'])

(二)、使用PaddleNLP一键加载预训练模型

PaddleNLP目前为UnifiedTransformer提供了三个中文预训练模型：

unified_transformer-12L-cn，该预训练模型是在大规模中文对话数据集上训练得到的。
unified_transformer-12L-cn-luge，该预训练模型是unified_transformer-12L-cn在千言对话数据集上进行微调得到的；
plato-mini，该模型使用了十亿级别的中文闲聊对话数据进行预训练。

from paddlenlp.transformers import UnifiedTransformerLMHeadModel

model = UnifiedTransformerLMHeadModel.from_pretrained(model_name)

(三) 、使用生成API输出模型预测结果

下一步我们将处理好的输入作为参数传递给generate()函数，并配置解码策略，这里我们使用的是TopK加sampling的解码策略，即从概率最大的k个结果中按概率进行采样。

ids.scores=model.generate(
    input_ids = encoded_input['input_ids'],
    token_type_ids = encoded_input['token_type_ids'],
    position_ids = encoded_input['position_ids'],
    attention_mask = encoder_input['attention_mask'],
    max_length =64,
    min_length = 1,
    decode_strategy = 'sampling',
    top_k = 5,
    num_return_sequences = 20
)
print(ids)
print(scores)

部分输出结果如下图所示：

四、将词典ID转为对应的汉字

#将词典ID转为对应的汉字
response = []
for sequence_ids in ids.numpy().tolist():
    sequence_ids = sequence_ids[:sequence_ids.index(tokenizer.sep_token_id)]
    text = tokenizer.convert_ids_to_string(sequence_ids,keep_space=False)
    response.append(text)
print(response)

因此，当我们在问机器人：“你好啊，你今年多大了”，可以得到的回复结果如下：

【自然语言处理（NLP）】聊天机器人模块实现_paddle_04

本系列文章内容为根据清华社出版的《自然语言处理实践》所作的相关笔记和感悟，其中代码均为基于百度飞桨开发，若有任何侵权和不妥之处，请私信于我，定积极配合处理，看到必回！！！

最后，引用本次活动的一句话，来作为文章的结语～(￣▽￣～)~：

【学习的最大理由是想摆脱平庸，早一天就多一份人生的精彩；迟一天就多一天平庸的困扰。】

【自然语言处理（NLP）】聊天机器人模块实现

【自然语言处理（NLP）】聊天机器人模块实现

(一)、任务描述

(二)、环境配置

一、下载并更新相关包

二、使用生成API实现闲聊机器人

(一)、数据处理

(二)、使用PaddleNLP一键加载预训练模型

(三) 、使用生成API输出模型预测结果

四、将词典ID转为对应的汉字

Recommend

印度板球传奇人物 Sachin Tendulkar 将在 NFT 平台 Rario 上发布独家 NFT 系列

Is it true that raising a structured exception from a structured exception handl...

Intel Graphics v31.0.101.3729 driver includes support for 13th gen Raptor Lake

Best Buy US deals: gift cards for the Pixel 7 series, carrier discounts for Gala...

精致生活家的「抗失眠软香毯」续能热爱，柔暖守护夜夜好眠

Amazon is quietly shutting down Fabric.com, one of the largest online fabric sto...

我们真的需要城市NOA吗？

How AI has revolutionised the Gaming Industry

Amazon’s Glow shutdown will make the devices useless at the end of 2022 - The Ve...

VC Ghostwriters and authenticity

About Joyk