The experience of using Google Cloud’s Text-to-Speech AI

Just using the Python API of Text-to-Speech AI to transform a PDF file to mp3 audio, as the example:

from google.cloud import texttospeech
from PyPDF2 import PdfReader

client = texttospeech.TextToSpeechClient()

reader = PdfReader("xxx.pdf")

voice = texttospeech.VoiceSelectionParams(
    language_code="cmn-CN", name="cmn-CN-Wavenet-B", ssml_gender=texttospeech.SsmlVoiceGender.MALE
)

audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3,
    speaking_rate=0.8,
)

text = ""
index = 1
# try first 10 pages
for page in reader.pages[:10]:
    text += page.extract_text()

print(len(text))
synthesis_input = texttospeech.SynthesisInput(text=text)
response = client.synthesize_speech(
    input=synthesis_input, voice=voice, audio_config=audio_config
)

with open("outout.mp3", "wb") as out:
    out.write(response.audio_content)
    print("Written")

Python

from google.cloud import texttospeech

from PyPDF2 import PdfReader

client = texttospeech.TextToSpeechClient()

reader = PdfReader("xxx.pdf")

voice = texttospeech.VoiceSelectionParams(

    language_code="cmn-CN", name="cmn-CN-Wavenet-B", ssml_gender=texttospeech.SsmlVoiceGender.MALE

audio_config = texttospeech.AudioConfig(

    audio_encoding=texttospeech.AudioEncoding.MP3,

    speaking_rate=0.8,

text = ""

index = 1

# try first 10 pages

for page in reader.pages[:10]:

    text += page.extract_text()

print(len(text))

synthesis_input = texttospeech.SynthesisInput(text=text)

response = client.synthesize_speech(

    input=synthesis_input, voice=voice, audio_config=audio_config

with open("outout.mp3", "wb") as out:

    out.write(response.audio_content)

    print("Written")

Very simple, right? But it just reported an error:

google.api_core.exceptions.InvalidArgument: 400 Either `input.text` or `input.ssml` is longer than the limit of 5000 bytes. This limit is different from quotas. To fix, reduce the byte length of the characters in this request, or consider using the Long Audio API: https://cloud.google.com/text-to-speech/docs/create-audio-text-long-audio-synthesis.

Shell

google.api_core.exceptions.InvalidArgument: 400 Either `input.text` or `input.ssml` is longer than the limit of 5000 bytes. This limit is different from quotas. To fix, reduce the byte length of the characters in this request, or consider using the Long Audio API: https://cloud.google.com/text-to-speech/docs/create-audio-text-long-audio-synthesis.

It seems the request is too long. Let’s use the “Long Audio API”:

from google.cloud import texttospeech
from PyPDF2 import PdfReader

client = texttospeech.TextToSpeechLongAudioSynthesizeClient()

reader = PdfReader("xxx.pdf")

voice = texttospeech.VoiceSelectionParams(
    language_code="cmn-CN", name="cmn-CN-Wavenet-B", ssml_gender=texttospeech.SsmlVoiceGender.MALE
)

audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.LINEAR16,
    speaking_rate=0.8,
)

text = ""
index = 1
for page in reader.pages[:10]:
    text += page.extract_text()

print(len(text))
synthesis_input = texttospeech.SynthesisInput(text=text)
request = texttospeech.SynthesizeLongAudioRequest(
    parent="projects/robin-00000/locations/us",
    input=synthesis_input, voice=voice, audio_config=audio_config,
    output_gcs_uri="gs://robin_tts/xxx.mp3"
)

operation = client.synthesize_long_audio(request=request)
result = operation.result(timeout=300)
print(result)

Python

from google.cloud import texttospeech

from PyPDF2 import PdfReader

client = texttospeech.TextToSpeechLongAudioSynthesizeClient()

reader = PdfReader("xxx.pdf")

voice = texttospeech.VoiceSelectionParams(

    language_code="cmn-CN", name="cmn-CN-Wavenet-B", ssml_gender=texttospeech.SsmlVoiceGender.MALE

audio_config = texttospeech.AudioConfig(

    audio_encoding=texttospeech.AudioEncoding.LINEAR16,

    speaking_rate=0.8,

text = ""

index = 1

for page in reader.pages[:10]:

    text += page.extract_text()

print(len(text))

synthesis_input = texttospeech.SynthesisInput(text=text)

request = texttospeech.SynthesizeLongAudioRequest(

    parent="projects/robin-00000/locations/us",

    input=synthesis_input, voice=voice, audio_config=audio_config,

    output_gcs_uri="gs://robin_tts/xxx.mp3"

operation = client.synthesize_long_audio(request=request)

result = operation.result(timeout=300)

print(result)

It couldn’t work still:

google.api_core.exceptions.InvalidArgument: 400 The long audio API does not support the language zh. Supported languages: en, es.

Shell

google.api_core.exceptions.InvalidArgument: 400 The long audio API does not support the language zh. Supported languages: en, es.

Okay. It doesn’t support the Chinese language. Then, what should I do if I want to translate a Chinese pdf to mp3? Convert them page by page into 500 mp3 files? This is terrible. Even for the short mp3 it generated, it definitely sounds like a machine, not a human.

Google has the state-of-the-art technology of deep learning but some of their products in the cloud are ridiculously hard to use (such as Vertex AI, and this Text-to-Speech).

After some searching (at least Google search is perfect as before), I found this NaturalReader. Surprisingly, it supports the Chinese language and the voice is as well as a real human. The only problem is it is very expensive for individual users.

First experiments about Vertex AI of Google Cloud
As the above menu show in the Vertex AI, it is trying to include all…
Google Cloud Summit 2019
Yesterday I joined the Google Cloud Summit 2019 in Sydney. The meeting place is quite…
Some test samples for Text-To-Speech solutions
I am doing some research on TTS (Text-To-Speech) recently and noticed three almost state-of-the-art and…

August 11, 2023 - 0:33 RobinDong industry
python
Leave a comment

The experience of using Google Cloud’s Text-to-Speech AI

The experience of using Google Cloud’s Text-to-Speech AI

Related Posts

Leave a Reply Cancel reply

Recommend

Why You Should Never Clean Your Headlights With Bug Spray

How Meta is improving password security and preserving privacy

How machine learning models can amplify inequities in medical diagnosis and trea...

All The Hidden Features Of Your iPhone's Volume Buttons

Major Gaming Headset Brands Ranked Worst To Best

AI models are powerful, but are they biologically plausible?

Use ExifTool to remove private data in images

What Happened To Bunch Bikes From Shark Tank Season 12?

How to create landing pages that convert (plus top examples & templates)

Coaching Agile Teams

About Joyk