1

The experience of using Google Cloud’s Text-to-Speech AI

 10 months ago
source link: https://donghao.org/2023/08/11/the-experience-of-using-google-clouds-text-to-speech-ai/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

The experience of using Google Cloud’s Text-to-Speech AI

Just using the Python API of Text-to-Speech AI to transform a PDF file to mp3 audio, as the example:

from google.cloud import texttospeech
from PyPDF2 import PdfReader

client = texttospeech.TextToSpeechClient()

reader = PdfReader("xxx.pdf")

voice = texttospeech.VoiceSelectionParams(
    language_code="cmn-CN", name="cmn-CN-Wavenet-B", ssml_gender=texttospeech.SsmlVoiceGender.MALE
)

audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3,
    speaking_rate=0.8,
)

text = ""
index = 1
# try first 10 pages
for page in reader.pages[:10]:
    text += page.extract_text()

print(len(text))
synthesis_input = texttospeech.SynthesisInput(text=text)
response = client.synthesize_speech(
    input=synthesis_input, voice=voice, audio_config=audio_config
)

with open("outout.mp3", "wb") as out:
    out.write(response.audio_content)
    print("Written")
Python
from google.cloud import texttospeech
from PyPDF2 import PdfReader
client = texttospeech.TextToSpeechClient()
reader = PdfReader("xxx.pdf")
voice = texttospeech.VoiceSelectionParams(
    language_code="cmn-CN", name="cmn-CN-Wavenet-B", ssml_gender=texttospeech.SsmlVoiceGender.MALE
)
audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3,
    speaking_rate=0.8,
)
text = ""
index = 1
# try first 10 pages
for page in reader.pages[:10]:
    text += page.extract_text()
print(len(text))
synthesis_input = texttospeech.SynthesisInput(text=text)
response = client.synthesize_speech(
    input=synthesis_input, voice=voice, audio_config=audio_config
)
with open("outout.mp3", "wb") as out:
    out.write(response.audio_content)
    print("Written")

Very simple, right? But it just reported an error:

google.api_core.exceptions.InvalidArgument: 400 Either `input.text` or `input.ssml` is longer than the limit of 5000 bytes. This limit is different from quotas. To fix, reduce the byte length of the characters in this request, or consider using the Long Audio API: https://cloud.google.com/text-to-speech/docs/create-audio-text-long-audio-synthesis.
Shell
google.api_core.exceptions.InvalidArgument: 400 Either `input.text` or `input.ssml` is longer than the limit of 5000 bytes. This limit is different from quotas. To fix, reduce the byte length of the characters in this request, or consider using the Long Audio API: https://cloud.google.com/text-to-speech/docs/create-audio-text-long-audio-synthesis.

It seems the request is too long. Let’s use the “Long Audio API”:

from google.cloud import texttospeech
from PyPDF2 import PdfReader

client = texttospeech.TextToSpeechLongAudioSynthesizeClient()

reader = PdfReader("xxx.pdf")

voice = texttospeech.VoiceSelectionParams(
    language_code="cmn-CN", name="cmn-CN-Wavenet-B", ssml_gender=texttospeech.SsmlVoiceGender.MALE
)

audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.LINEAR16,
    speaking_rate=0.8,
)

text = ""
index = 1
for page in reader.pages[:10]:
    text += page.extract_text()

print(len(text))
synthesis_input = texttospeech.SynthesisInput(text=text)
request = texttospeech.SynthesizeLongAudioRequest(
    parent="projects/robin-00000/locations/us",
    input=synthesis_input, voice=voice, audio_config=audio_config,
    output_gcs_uri="gs://robin_tts/xxx.mp3"
)

operation = client.synthesize_long_audio(request=request)
result = operation.result(timeout=300)
print(result)
Python
from google.cloud import texttospeech
from PyPDF2 import PdfReader
client = texttospeech.TextToSpeechLongAudioSynthesizeClient()
reader = PdfReader("xxx.pdf")
voice = texttospeech.VoiceSelectionParams(
    language_code="cmn-CN", name="cmn-CN-Wavenet-B", ssml_gender=texttospeech.SsmlVoiceGender.MALE
)
audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.LINEAR16,
    speaking_rate=0.8,
)
text = ""
index = 1
for page in reader.pages[:10]:
    text += page.extract_text()
print(len(text))
synthesis_input = texttospeech.SynthesisInput(text=text)
request = texttospeech.SynthesizeLongAudioRequest(
    parent="projects/robin-00000/locations/us",
    input=synthesis_input, voice=voice, audio_config=audio_config,
    output_gcs_uri="gs://robin_tts/xxx.mp3"
)
operation = client.synthesize_long_audio(request=request)
result = operation.result(timeout=300)
print(result)

It couldn’t work still:

google.api_core.exceptions.InvalidArgument: 400 The long audio API does not support the language zh. Supported languages: en, es.
Shell
google.api_core.exceptions.InvalidArgument: 400 The long audio API does not support the language zh. Supported languages: en, es.

Okay. It doesn’t support the Chinese language. Then, what should I do if I want to translate a Chinese pdf to mp3? Convert them page by page into 500 mp3 files? This is terrible. Even for the short mp3 it generated, it definitely sounds like a machine, not a human.

Google has the state-of-the-art technology of deep learning but some of their products in the cloud are ridiculously hard to use (such as Vertex AI, and this Text-to-Speech).

After some searching (at least Google search is perfect as before), I found this NaturalReader. Surprisingly, it supports the Chinese language and the voice is as well as a real human. The only problem is it is very expensive for individual users.

Related Posts

August 11, 2023 - 0:33 RobinDong industry
python
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Comment *

Name *

Email *

Website

Save my name, email, and website in this browser for the next time I comment.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK