This new AI can mimic human voices with only 3 seconds of training

Humanity has taken another step towards its inevitable war with machines (we lose) with the creation of Vall-E, an AI developed by a team of researchers at Microsoft. Vall-E can generate high-quality human voice reproductions in just seconds. About audio training.

Vall-E is not the first AI-powered voice tool — xVASynth (opens in new tab)for example, has been around for several years, but promises to surpass them all in terms of pure functionality. Cornell University (opens in new tab) (via Windows Central (opens in new tab)), Vall-E researchers say, most current text-to-speech systems are limited because they rely on “high-quality clean data” to accurately synthesize high-quality speech. increase.

“Large-scale data crawled from the Internet cannot meet requirements and always lead to poor performance,” the paper states. “His current TTS system remains poorly generalized due to the relatively small training data. It drops dramatically against people.”

(“Zero shot scenario (opens in new tab)In this case, it essentially means the ability of the AI to reproduce voices without special training. )

Vall-E, on the other hand, is trained using a much larger and more diverse dataset. 60,000 hours of English speech extracted from over 7,000 unique speakers, all transcribed by speech recognition software. The data fed to the AI contains “noisier speech and inaccurate transcriptions” than those used by other text-to-speech systems, but the researchers noted that the scale of the input and the We believe that its versatility makes it much more flexible and adaptable. And — this is the big one — it’s more natural than its predecessor.

“Experimental results show that Vall-E significantly outperforms state-of-the-art zero-shot TTS systems in terms of speech naturalness and speaker similarity,” he says, filling in numbers, equations and diagrams. The exhausted paper states: complexity. “Furthermore, we found that VALL-E can preserve the speaker’s emotions and the acoustic environment of the voice prompts in the synthesis.”

(Image credit: Vall-E)

You can actually hear Vall-E in action. github (opens in new tab)At , the research team shares a quick breakdown of how it all works, as well as dozens of sample inputs and outputs. Quality varies. Some voices sound robotic, while others are very human. But as a sort of first-pass tech demo, it’s impressive. Imagine what this technology will look like in 1, 2, or 5 years as the system improves and the voice training dataset grows even more.

Of course, that’s why it matters. His Dall-E, an AI art generator, faces backlash over privacy and ownership concerns. (opens in new tab)the ChatGPT bot was so compelling that it was recently banned by the New York City Department of Education. (opens in new tab)Vall-E could be even more concerning as it could be used to augment fraudulent marketing calls and deepfake videos.It may sound a little daunting, but as Editor-in-Chief Tyler Wilde said earlier this year, things like this aren’t going away. (opens in new tab)it is important to recognize potential problems and regulate the creation and use of AI systems before they become real (and really big) problems.

The Vall-E research team addressed these “broader effects” in their paper’s conclusions. “Because VALL-E can synthesize speech that preserves the identity of the speaker, it can carry potential risks of model misuse, such as spoofing speech identification or impersonating a specific speaker,” the team wrote. “To mitigate such risks, it is possible to build a detection model that identifies whether an audio clip was synthesized by VALL-E. Microsoft AI Principles (opens in new tab) Practice as you develop the model further. ”

If you want more evidence that on-the-fly voice imitation leads to bad places:

Subscribe to Updates

What's Hot

This new AI can mimic human voices with only 3 seconds of training

Related Posts