Let’s face it, technology has always been about making the impossible possible. And in this case, we’re talking about an AI capable of generating more than just pretty prose and doodles. Enter Meta’s latest innovation, Voicebox. This nifty little number is the tech giant’s attempt to revolutionize the spoken word in the same way that ChatGPT and Dall-E have transformed text and image generation, respectively.
The Magic Behind the Microphone
Now, you may be wondering, “What exactly does this Voicebox do?” Well, in simple terms, it’s a generator for the spoken word, much like its text and image counterparts. But, instead of churning out paragraphs or pastel portraits, it produces audio clips. Described by Meta as a “non-autoregressive flow-matching model trained to infill speech, given audio context and text,” Voicebox has been trained on a whopping 50,000 hours of unfiltered audio.
The data used to train Voicebox is an international buffet of languages – English, French, Spanish, German, Polish, and Portuguese. It’s this diversity that gives the system its ability to generate speech that sounds like a conversation, no matter the language being spoken.
A New Era of Speech Generation
The researchers at Meta are quite pleased with the results, saying that the speech recognition models trained on Voicebox-generated synthetic speech perform nearly as well as models trained on real speech. And the error rate? Barely a blip on the radar at just 1%, compared to the whopping 45 to 70% drop-off seen with existing TTS models.
The system’s learning process is also rather ingenious. It was initially taught to predict speech segments based on surrounding segments and the transcript of the passage. Once it mastered that, it was able to apply this knowledge to a variety of speech generation tasks.
More than Just a Talking Head
But Voicebox isn’t just good at talking; it’s also a pro at editing. It can actively edit audio clips, removing noise and even replacing mispronounced words. You know that annoying barking dog in the background of your conference call? Voicebox can help with that.
Now, while TTS generators are not a new concept (remember the days of receiving questionable driving directions in Morgan Freeman’s voice?), modern versions like Speechify or Elevenlab’s Prime Voice AI require a plethora of source material to accurately mimic their subject. Voicebox, on the other hand, has a trick up its sleeve. Thanks to a novel new zero-shot text-to-speech training method that Meta calls Flow Matching, it doesn’t need mountains of data.
Breaking Records and Taking Names
The results? Well, they’re not even close. Meta’s AI reportedly outperforms the current state of the art in both intelligibility and “audio similarity,” all while operating up to 20 times faster than today’s best TTS systems.
But hold your horses before you start dreaming of celebrity navigation voices. Neither the Voicebox app nor its source code are being released to the public at this time. Meta has cited potential misuse risks as the reason for this decision. Instead, the company has released a series of audio examples and the program’s initial research paper. Looking ahead, the research team hopes that this technology will find its way into prosthetics for patients with vocal cord damage and digital assistants.
Source: www.engadget.com