VALL-E: Microsoft’s new zero-shot text-to-speech model can duplicate everyone’s voice in three seconds

News Report Technology

by Damir Yalalov

Published: Jan 08, 2023 at 3:30 am Updated: Jan 08, 2023 at 3:30 am

In Brief

With just a three-second sample of any voice, the transformer-based TTS model VALL-E can produce speech in every voice.

This is a significant advancement in the direction of more natural-sounding TTS systems.

Microsoft has, however, provided a few samples of the model in use, and it is evident that this represents a significant development in TTS technology.

The Trust Project is a worldwide group of news organizations working to establish transparency standards.

Since the release of the first text-to-speech (TTS) model, researchers have been looking for ways to improve the way these systems generate speech. The latest model from Microsoft, VALL-E, is a significant step forward in this regard.

VALL-E is a transformer-based TTS model that can generate speech in any voice after only hearing a three-second sample of that voice. This is a significant improvement over previous models, which required a much longer training period in order to generate a new voice.

VALL-E is an amazing technological feat that has the potential to change the way we interact with digital media.

Additionally, the intonation, charisma, and style of the voice are all kept intact in the generated speech. This is an important step forward in making TTS systems sound more natural.

This model is transformer-based and has a Dale-1 appearance. Not to be confused with the diffusion-based Dalle-2. The code is still lacking. And users have some skepticism that they will post it. However, Microsoft has released a few examples of the model in action, and it is clear that this is a major advance in TTS technology.

Example #1:

Example #2:

Example #3:

Read more about AI:

Disclaimer

Any data, text, or other content on this page is provided as general market information and not as investment advice. Past performance is not necessarily an indicator of future results.

Damir Yalalov

Damir is the Editor/SEO/Product Lead at mpost.io. He is most interested in SecureTech, Blockchain, and FinTech startups. Damir earned a bachelor's degree in physics.

VALL-E: Microsoft's new zero-shot text-to-speech model can duplicate everyone's...

VALL-E: Microsoft’s new zero-shot text-to-speech model can duplicate everyone’s voice in three seconds

Recommend

Online graph editor tool — csacademy.com

广州花都区与港科大（广州）合作建设科研智库，发展AI智慧农业

33% Of People Think This Is The Worst Aston Martin Of All Time - SlashGear Surve...

腾达在CES 2023上发布首款TE60 Pro Wi-Fi 7路由器搭载博通2.6GHz顶级CPU

Surface Laptop Studio gets improved Windows Hello facial recognition

Japan, US to step up cooperation in developing next-generation nuclear reactors

results not proportional to practice

钉钉首次召开企业服务生态伙伴全员大会，头部伙伴营收实现跨越式增长-品玩

Angular Basics: Extend Kendo UI Calendar—Angular Directives

万亿赛道AIGC 其背后的冷思考-品玩

About Joyk