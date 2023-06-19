What’s the news: On June 16, 2023, Meta announced ‘Voicebox’, a generative AI model that can carry out speech-generation tasks “it was not specifically trained” for. Although there have been reports of Mozilla working on an open-source Common Voice project in the past, Meta claims that Voicebox is the first speech model of its kind.

What can Voicebox do?

The AI model can produce audio clips, synthesize speech across six languages, and perform noise removal, content editing, style conversion, and diverse sample generation. These functions can be explained with the following use cases:

In-context text-to-speech synthesis: Voicebox can use a minimum two-second long input audio sample to match the audio style and use it for a text-to-speech generation. According to Meta, this can ‘bring speech’ to people unable to speak and customize the voices of non-player characters and virtual assistants.

Cross-lingual style transfer: The model can help people speaking different languages communicate with each other, provided it gets a sample of speech and a text of the relevant content in the required language. So far, Voicebox has been trained in English, French, German, Spanish, Polish and Portuguese.

Speech denoising and editing: Voicebox’s in-context learning can edit segments within audio recordings.

“It can resynthesize the portion of speech corrupted by short-duration noise or replace misspoken words without having to rerecord the entire speech,” said Meta.

Diverse speech sampling: “Voicebox can generate speech that is more representative of how people talk in the real world and across the six languages listed above. In the future, this capability could be used to generate synthetic data to help better train a speech assistant model,” said Meta.

It said that speech recognition models trained on Voicebox-generated synthetic speech perform almost as well as models trained on real speech. Voicebox has a one percent error rate degradation, whereas previous text-to-speech models had 45 to 70 percent error rate degradation with synthetic speech.

Why it matters: While Mozilla’s Common Voice looked to create an open audio bank for easier editing and innovative voice-enabled technologies, Meta’s researchers have gone for a more ambitious approach with this new AI model. However, whether they are making these strides with basic privacy and data rights in mind remains to be seen. The past year has recorded many new emerging technologies like ChatGPT, although we still know very little about the model’s characteristics, etc. So, it will be interesting to see how Meta deals with this model in the future while trying to adhere to the ethical standards around generative AI systems.

Voicebox code not publicly available: While Meta researchers claimed it is important to be open with the AI community, they still decided that they will not be making the Voicebox model or code publicly available at this time.

Classifier to distinguish real and synthetic speech: Researchers working on the AI model said that they recognized the model’s “potential for misuse and unintended harm.” As such, they talked in their research paper about a classifier that can distinguish between authentic speech and audio generated with Voicebox.

