Voice bot powered by SAP Conversational AI

Currently it is quite easy to make voice recognition bot based on SAP and Open Source technology. And the synergy is also very clear. If voice recognition is wrong, if there are typos – CAI (SAP Conversational AI) could help and recognise correct intent. This is first part and focus will be on docker and CAI settings. In the second part we will go through publishing process to Kyma.

Architecture

From architecture point of view – we are going to connect CAI with docker container where all code for Automatic Speech Recognition (ASR) will be running. Also, concrete telegram bot and ID(group or personal) will be there.

So, the picture will look like this:

The file structure will be :

cai.py – interaction with CAI
voicebot.py – ASR and main logic
Dockerfile – instruction for container build

I think you already guessed that the code will be in python;)

Automatic Speech Recognition

There are a lot of different engines for ASR now. We will use transformers library from Huggingface. The full list of available models and your language – you can find here:

Also, it is quite easy to replace this model with Nvidia NEMO.

You can find relevant tutorials here.

All code here – is not production ready. Just examples!!!

So, to make this idea available – let’s create folder with files:

cai.py, voicebot.py, Dockerfile

cai.py

from oauthlib.oauth2 import BackendApplicationClient
from requests_oauthlib import OAuth2Session
import uuid
import requests
import json
import os

class CAI:
    oAuthClientID = os.environ['oAuthClientID']
    oAuthClientSecret = os.environ['oAuthClientSecret']
    CAIreqToken = os.environ['CAIreqToken']
    def __init__(self):
        self.oAuthURL = 'https://sapcai-community.authentication.eu10.hana.ondemand.com/oauth/token'
        self.dialogURL = 'https://api.cai.tools.sap/build/v1/dialog'
        self.token = self._get_bearer()
    def _get_bearer(self):
        client = BackendApplicationClient(client_id=self.oAuthClientID)
        oauth = OAuth2Session(client=client)
        token = oauth.fetch_token(token_url=self.oAuthURL, client_id=self.oAuthClientID,
                client_secret=self.oAuthClientSecret)
        return token['access_token']
    def get_response(self,text):
        dialogPayload = {"message":{"type":"text","content":text},"conversation_id":str(uuid.uuid1())}
        dialogHeaders = {
                "Authorization": "Bearer " + self.token,
                "X-Token" : "Token " + self.CAIreqToken,
                "Content-Type" : "application/json"
            }
        dialogResponse = requests.post(self.dialogURL, data=json.dumps(dialogPayload), headers=dialogHeaders)
        if dialogResponse.status_code==requests.codes.ok:
            return dialogResponse.json()['results']['messages']

voicebot.py

import telegram
from telegram.ext import Updater,MessageHandler,Filters,CommandHandler
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import logging
from cai import CAI
import os

logging.basicConfig(level=logging.INFO)

config = {
    'API_KEY':os.environ['API_KEY'],
    'id':[int(os.environ['id'])],
}

LANG_ID = "en"#"ru"# 
if LANG_ID=='ru':
    MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-russian"
else:
    MODEL_ID = "facebook/wav2vec2-base-960h"
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

def get_preds(OUTFILE):
    resampler = torchaudio.transforms.Resample(48_000, 16_000)

    def speech_file_to_array_fn(batch):
        speech_array, sampling_rate = torchaudio.load(batch)
        batch = resampler(speech_array).squeeze().numpy()
        return batch

    test_dataset = speech_file_to_array_fn(OUTFILE)

    inputs = processor(test_dataset, sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        if LANG_ID=='ru':
            logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
        else:
            logits = model(inputs.input_values).logits
    
    predicted_ids = torch.argmax(logits, dim=-1)
    return processor.batch_decode(predicted_ids)

c = CAI()

def voice_handler(update, context):
    file_handler = context.bot.getFile(update.message.voice.file_id)
    file = file_handler.download('./voice.ogg')
    try:
        text = get_preds(file)[0]
        logging.info(f'The text - {text}')
        cai_resp = c.get_response(text)
        for i in cai_resp:
            if i['type']=='text':
                update.message.reply_text(i['content'])
    except:
        update.message.reply_text('Sorry!')

def text_handler(update, context):
    cai_resp = c.get_response(update.message.text)
    for i in cai_resp:
        if i['type']=='text':
            update.message.reply_text(i['content'])

def help_command(update, context):
    update.message.reply_text('Help!')

def main() -> None:
    """Run the bot."""
    logging.info('Ready!')
    # Create the Updater and pass it your bot's token.
    updater = Updater(config['API_KEY'])

    # Get the dispatcher to register handlers
    dispatcher = updater.dispatcher
    
    dispatcher.add_handler(MessageHandler(Filters.voice, voice_handler))
    dispatcher.add_handler(MessageHandler(Filters.text, text_handler))

    dispatcher.add_handler(CommandHandler("help", help_command))

    updater.start_polling()

    updater.idle()


if __name__ == '__main__':
    main()

Dockerfile

FROM pytorch/pytorch:latest
COPY ./cai.py cai.py
COPY ./voicebot.py voicebot.py
RUN pip3 install torchaudio python-telegram-bot transformers oauthlib requests-oauthlib
CMD [ "python3", "voicebot.py"]

Instructions to start

First of all after all files preparation we have to build docker image.

We can do it with

> docker build -t cai .

After that we need some keys.

From CAI – we need ClientID, CLientSecret and Token – you can find all relevant info in this nice blogpost.

Also, we need Telegram token and group or person ID. I hope you can find it yourself. If not – don’t hesitate to ask.

So, we can run our bot locally with this command (just replace values with yours)

> docker run -d –name cairun -e oAuthClientID=’YOUR CAI CLIENT ID’ -e oAuthClientSecret=’YOUR CAI CLIENT SECET’ -e CAIreqToken=’YOUR CAI TOKEN ‘ -e API_KEY=’TELEGRAM BOT KEY’ -e id=’YOUR TELEGRAM ID’ cai

After that – you can try.

My native language is Russian – so, my bot talk russian. This one has to talk english with help of wav2vec model from Facebook.

Happy voice-botting

As next step – we will push this container to Kyma runtime to make it available as service.

Voice bot powered by SAP Conversational AI

Voice bot powered by SAP Conversational AI

Architecture

Automatic Speech Recognition

Instructions to start

Happy voice-botting

Recommend

How to Recover Data from an External Hard Drive

华尔街从 Reddit 上获取投资建议

写了 20 多个 VS Code 插件之后，我又发现了新的风口：Teams小程序！

矿视界译文：中国监管打压后，北美比特币矿商创销售纪录

元宇宙价值指数今日为569.27点

空间站将测试木材在太空环境的可用性

Reporting From BornHack 2021: Hacker Camps Making It Through The Pandemic

三箭资本创始人与 Digital Art Trader合作推出新的NFT基金

Crystal 提供其区块链侦查软件的免费版本

How to Use Python as a Command-Line Calculator

About Joyk