VOICECRAFT edits and generates speech in seconds

VOICECRAFT is an innovative AI tool for speech editing and zero-shot text-to-speech (TTS) generation. It can insert words, delete words, and change phrases within the speech without the need for a new recording session. It can also generate speech from text prompts in a target voice, even for voices it has never processed before.

VOICECRAFT outperforms state-of-the-art models such as VALL-E and XTTS-v2. The model is open source and accessible in the GitHub repository (code, data, and model weights). It has links to the project’s page with audio samples, Space on Hugging Face, and Colab notebook.

Speech editing with VOICECRAFT (source: paper)

What can VOICECRAFT do?

VOICECRAFT is a token infilling neural codec language model capable of:

  1. Speech editing: acts as an intelligent autocorrect for recordings with mistakes or unwanted sounds.
  2. Zero-Shot TTS: produces speech from text in a target voice, without the need for pre-existing samples of that voice.

It can be used for various applications and types of audio content, such as:

  • audiobooks: to edit or create spoken versions of books.
  • Internet videos: to improve the sound quality or add missing parts to spoken content in online videos.
  • podcasts: to edit podcast episodes, remove mistakes or gaps in speech.

The model

VOICECRAFT is a neural codec language model (NCLM) built on the Transformer architecture. It makes speech editing and TTS easier by treating them like regular left-to-right language tasks. This is facilitated by a strategic rearrangement of the output tokens generated by the neural codec.

The neural codec takes a speech signal (like your voice) and encode it into a more efficient representation, and then decode it back into a recognizable speech waveform.

The next figure illustrates the token rearrangement procedure employed within the modeling framework.

An example of token rearrangement and modeling (source: paper)

The method consists of 2 key steps:

  1. Causal masking used to hide parts of the recording (masking). Initially, parts of the sequence that need to be hidden are replaced with placeholder symbols known as mask tokens. These masked sections are then shifted to the end of the sequence. This rearrangement is crucial for the next step – predicting the hidden elements – because autoregressive models generate predictions sequentially. They use the known, unmasked information that precedes the masked sections to infer what the concealed elements should be, even though they have been moved to the end.
  2. Delayed stacking carefully inserts new words into a sentence without disrupting its overall meaning.

Unlike conventional models that only look at the preceding elements, VOICECRAFT also considers what comes after the point of interest.This bi-directional consideration allows it to make more accurate predictions about what the speech should sound like, resulting in more natural and coherent speech output.

Take, for instance, the task of inserting a new sentence within a paragraph. The model evaluates both the preceding and following sentences relative to the insertion spot, to make sure the new sentence fits in smoothly with the rest of the speech.

Training

VOICECRAFT is trained for speech editing and text-to-speech generation with high precision, even for voices it hasn’t been trained on. It employs a Transformer-based architecture for making predictions and an Encodec model for speech tokenization.

  • During the training phase, the model learns how to predict and fill in gaps by randomly masking parts of the speech that need editing.
  • When it comes to inference, the masking is more targeted, being based on the differences between the original and the desired transcripts.

VOICECRAFT was trained on the GigaSpeech training set, which contains 9k hours of audiobooks, podcasts, and YouTube videos at 16kHz audio sampling rate. Audio files shorter than 2 seconds were excluded from the training data.

The main model of VOICECRAFT has 830M parameters and its training took about 2 weeks on 4 NVIDIA A40 GPUs.

REALEDIT dataset

The team manually constructed a unique dataset named REALEDIT, to evaluate the model’s speech editing capabilities in real-world scenarios. It consists of 310 recordings, each lasting between 5 to 12 seconds, collected from audiobooks, YouTube videos, and podcasts.The dataset covers a lot of editing tasks, from changing a single word to replacing up to 16 words.

Unlike commonly used speech synthesis datasets focused solely on audiobooks (VCTK, LJSpeech, LibriTTS), REALEDIT incorporates diverse content from various sources. The recordings in REALEDIT feature a wider range of accents, speaking styles, recording conditions, and background sounds compared to existing datasets.

Evaluation

The model was evaluated on a diverse set of datasets, including audiobooks, internet videos, and podcasts.

VOICECRAFT outperforms prior state-of-the-art models, including VALL-E and the commercially available XTTS-v2, in both speech editing and zero-shot TTS tasks.

The human evaluation focused on the naturalness of the edited and synthesized speech. The listeners could hardly tell the difference between VOICECRAFT’s output and the original recordings.

VOICECRAFT’s performance comparison on speech editing (source: paper)

Conclusion

VOICECRAFT is a cutting-edge neural codec language model for speech editing and zero-shot text-to-speech (TTS) tasks using real-world data. It uses a novel token rearrangement method that allows for the efficient and potent generation of autoregressive codecs, considering the context from both directions.

The model’s limitations include occasional long silences and scratching sounds during generation, and while current solutions involve selecting shorter utterances, more sophisticated methods are needed; additionally, the model presents new challenges for AI safety, particularly in watermarking and detecting synthesized speech.

Learn more:

Other popular posts