【深度学习前沿应用】文本审核

作者简介：在校大学生一枚，C/C++领域新星创作者，华为云享专家，阿里云专家博主，腾云先锋（TDP）成员，云曦智划项目总负责人，全国高等学校计算机教学与产业实践资源建设专家委员会（TIPCC）志愿者，以及编程爱好者，期待和大家一起学习，一起进步~
.
博客主页： ぃ灵彧が的学习日志
.
本文专栏： 机器学习
.
专栏寄语：若你决定灿烂，山无遮，海无拦
.

(文章目录)

1. 为什么要内容审核?

网络世界，内容参差不齐。优质内容是“流量天使”，背后的商业价值不言而喻，而劣质甚至违规内容一旦触碰法律红线，对社会和平台本身都是威胁。要想守护内容平台的一片清净，内容审核不可或缺。

2. 文本检测模型的作用？

文本检测模型可自动判别文本是否涉及敏感词，并给出相应的置信度，对文本中的各种描述和文案进行识别。

3. 如何实现文本检测模型？

PaddleHub提供了使用多种网络结构进行的色情文本检测预训练模块，比如，porn_detection_lstm采用LSTM网络结构并按字粒度切词进行文本检测，该模型最大句子长度为256字，但是仅仅支持预测，无法微调，相似功能的模型还有porn_detection_gru及porn_detection_cnn，分别使用GRU、CNN模型框架进行文本检测。

本实验的目的是简单地演示如何使用PaddleHub工具，实现文本审核（仅推理过程，暂不支持微调），实验平台为百度AI Studio，实验环境为Python3.7，Paddle2.0，PaddleHub2.0。

一、定义文本审核接口函数

(一)、数据加载及预处理

导入相关包

# from __future__ import print_function
import json
import six
# import paddlehub as hub
import paddlehub as hub

test_text = ["打击色情犯罪，是每一个人的责任", '引导未成年人远离黄赌毒'] #文本内容
use_gpu=True #是否调用GPU
batch_size=2 #批处理大小

(二)、通过LSTM实现

def porn_detection_lstm(test_text, use_gpu ,batch_size):
    # Load porn_detection_lstm module
    porn_detection_lstm = hub.Module(name="porn_detection_lstm")
    input_dict = {"text": test_text}
    results = porn_detection_lstm.detection(data=input_dict, use_gpu=use_gpu, batch_size=batch_size)
    return results

(三)、通过GRU实现

def porn_detection_gru(test_text, use_gpu, batch_size):
    # Load porn_detection_gru module
    porn_detection_gru = hub.Module(name="porn_detection_gru")
    input_dict = {"text": test_text}
    results = porn_detection_gru.detection(data=input_dict, use_gpu=use_gpu, batch_size=batch_size)
    return results

(四)、通过CNN实现

def porn_detection_cnn(test_text, use_gpu, batch_size):
    # Load porn_detection_cnn module
    porn_detection_cnn = hub.Module(name="porn_detection_cnn")
    results = porn_detection_cnn.detection(texts=test_text, use_gpu=use_gpu, batch_size=batch_size)
    return results

二、使用不同的模型进行审核

lstm = porn_detection_lstm(test_text, use_gpu, batch_size) #调用lstm
gru = porn_detection_gru(test_text, use_gpu, batch_size)   #调用gru
cnn = porn_detection_cnn(test_text, use_gpu, batch_size)   #调用cnn

三、输出审核结果

(一)、定义格式化输出函数

Tips:label的值越高则涉及色情的可能性越高

def output_dict(dic,pre='LSTM'):
    print('------\nPorn detection with {}'.format(pre))
    for line in dic:
        for k,v in line.items():
            print('{:20s}: {}'.format(k,v))

(二)、分别格式化输出三种审核结果

output_dict(lstm)
output_dict(gru,'GRU')
output_dict(cnn,'CNN')

输出结果如下图1所示：

(三)、求三种审核结果的平均结果作为输出

for index, text in enumerate(test_text):
    lstm[index]["text"] = text
    print("文本内容:",text)
    label = (lstm[index]["porn_detection_label"] + gru[index]["porn_detection_label"] + cnn[index]["porn_detection_label"])
    porn_probs = (lstm[index]["porn_probs"] + gru[index]["porn_probs"] + cnn[index]["porn_probs"])/3
    not_porn_probs = (lstm[index]["not_porn_probs"] + gru[index]["not_porn_probs"] + cnn[index]["not_porn_probs"])/3
    print('label:%0.0f, porn_probs:%0.5f, not_porn_probs:%0.5f' % (label, porn_probs, not_porn_probs))

输出结果如下图2所示：

本系列文章内容为根据清华社出版的《机器学习实践》所作的相关笔记和感悟，其中代码均为基于百度飞桨开发，若有任何侵权和不妥之处，请私信于我，定积极配合处理，看到必回！！！

最后，引用本次活动的一句话，来作为文章的结语～(￣▽￣～)~：

【学习的最大理由是想摆脱平庸，早一天就多一份人生的精彩；迟一天就多一天平庸的困扰。】

【深度学习前沿应用】文本审核

【深度学习前沿应用】文本审核

1. 为什么要内容审核?

2. 文本检测模型的作用？

3. 如何实现文本检测模型？

一、定义文本审核接口函数

(一)、数据加载及预处理

(二)、通过LSTM实现

(三)、通过GRU实现

(四)、通过CNN实现

二、使用不同的模型进行审核

三、输出审核结果

(一)、定义格式化输出函数

(二)、分别格式化输出三种审核结果

(三)、求三种审核结果的平均结果作为输出

Recommend

Wealthiest People in China (September 6, 2022)

Collaborative machine learning that preserves privacy

Zeerozone and The Sandbox Are Two Gaming Coins That Can Be Great!

Bringing Chefs to Your At-Home Dinner Table

JavaScript: Build Objects and Eliminate Looping with Reduce()

饮品界“月饼大赏”：星巴克、奈雪、书亦，你pick谁的创意？

特斯拉 FSD 正式涨价至 1.5 万美元，并未增加新功能

看完龙族大战指环王，我发现流媒体的尽头还得靠“拼爹”

五芳斋四闯IPO后终上市，来自第一股的“致命”吸引丨浙氪热点

同为山岳景区，为何黄山旅游亏最惨？

About Joyk