DeepMind Introduces ‘EATS’ – An Adversarial, End-to-End Approach to TTS - JOYK Joy of Geek, Geek News, Link all geek

DeepMind wowed the research community several years ago by defeating grandmasters in the ancient game of Go, and more recently saw its self-taught agents thrash pros in the video game StarCraft II. Now, the UK-based AI company has delivered another impressive innovation, this time in text-to-speech (TTS).

Text-to-speech (TTS) systems take natural language text as input and produce synthetic human-like speech as their output. The text-to-speech synthesis pipelines are complex, comprising multiple processing stages such as text normalisation, aligned linguistic featurisation, mel-spectrogram synthesis, raw audio waveform synthesis and so on.

Although contemporary TTS systems like those used in digital assistants like Siri boast high-fidelity speech synthesis and wide real-world deployment, even the best of them still have drawbacks. Each stage requires expensive “ground truth” annotations to supervise the outputs, and the systems cannot train directly from characters or phonemes as input to synthesize speech in the end-to-end manner increasingly favoured in other machine learning domains.

To address these issues, DeepMind researchers have developed EATS, a generative model trained adversarially in an end-to-end manner that achieves performance comparable to SOTA models that rely on multi-stage training and additional supervision.

EATS (End-to-end Adversarial TTS) is tasked with mapping an input sequence of characters or phonemes to raw audio at 24 kHz. A critical real-world challenge is that the input text and output speech signals will generally have very different lengths and are not aligned. EATS deals with this via two high-level submodules: An aligner which predicts the duration of each input token and produces an audio-aligned representation, and a decoder which upsamples the aligner’s output to the full audio frequency.

Noteworthy points of the EATS model include:

The entire generator architecture is differentiable, and is trained end-to-end.
It is a feed-forward convolutional neural network, which makes it suitable for applications where fast batched inference is important.
The adversarial approach enables the generator to learn from a relatively weak supervisory signal, significantly reducing the cost of annotations.
It does not rely on autoregressive sampling or teacher forcing, avoiding issues like exposure bias and reduced parallelism at inference time, which makes it efficient in both training and inference.

Researchers evaluated EATS using Mean Opinion Score (MOS) to measure speech quality. In the tests, all models were trained on datasets of human speech performed by professional voice actors and their corresponding text. The voice pool comprised 69 North American English speakers.

Compared to previous models, EATS requires substantially less supervision but still achieves an MOS of 4.083, approaching the level of SOTA methods like GAN-TTS and WaveNet, and substantially better than models like No RWDs, No MelSpecD, and No Discriminators.

The paper End-to-End Adversarial Text-to-Speech is on arXiv .

Author: Hecate He | Editor : Michael Sarazen & Yuan Yuan

DeepMind Introduces ‘EATS’ – An Adversarial, End-to-End Approach to TTS

Recommend

携程基于 Mirror 集群的自助性能测试实践

浅谈为什么L2正则化有效

机器狗的烦恼：网红练习生容易，商业实习生好难

成为台积电

老照片修复、上色一条龙，没有 PS 技术也能让你「穿越时空」

飞凡宣告注销，王健林为何做不成电商

玩不过你们这些玩可转债的，我不玩了行不行？

升级后别后悔！苹果关闭iOS13.5验证不允许用户降级

英国：华为承诺继续助力网络建设媒体称离开华为代价巨大

华为买下英国多家报社头版广告为自家5G网络宣传

About Joyk