6

[2204.11792] SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

 2 years ago
source link: https://arxiv.org/abs/2204.11792
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

[Submitted on 25 Apr 2022]

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

Download PDF

The recent progress in non-autoregressive text-to-speech (NAR-TTS) has made fast and high-quality speech synthesis possible. However, current NAR-TTS models usually use phoneme sequence as input and thus cannot understand the tree-structured syntactic information of the input sequence, which hurts the prosody modeling. To this end, we propose SyntaSpeech, a syntax-aware and light-weight NAR-TTS model, which integrates tree-structured syntactic information into the prosody modeling modules in PortaSpeech \cite{ren2021portaspeech}. Specifically, 1) We build a syntactic graph based on the dependency tree of the input sentence, then process the text encoding with a syntactic graph encoder to extract the syntactic information. 2) We incorporate the extracted syntactic encoding with PortaSpeech to improve the prosody prediction. 3) We introduce a multi-length discriminator to replace the flow-based post-net in PortaSpeech, which simplifies the training pipeline and improves the inference speed, while keeping the naturalness of the generated audio. Experiments on three datasets not only show that the tree-structured syntactic information grants SyntaSpeech the ability to synthesize better audio with expressive prosody, but also demonstrate the generalization ability of SyntaSpeech to adapt to multiple languages and multi-speaker text-to-speech. Ablation studies demonstrate the necessity of each component in SyntaSpeech. Source code and audio samples are available at this https URL

Comments: Accepted by IJCAI-2022. 12 pages
Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as: arXiv:2204.11792 [cs.SD]
  (or arXiv:2204.11792v1 [cs.SD] for this version)
  https://doi.org/10.48550/arXiv.2204.11792

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK