NVIDIA Canary 1B, a speech recognition and translation model

February 17, 2024

Canary is a new multilingual speech-to-text recognition and translation model from the NVIDIA NeMo team.

With 1 billion parameters, it is able to transcribe speech in English, Spanish, German, and French, with punctuation and capitalization capabilities. Additionally, it offers bidirectional translation between these languages.

Canary is currently the best open-source speech recognition model available, significantly surpassing all other open-source models. According to the Hugging Face Open ASR leaderboard, it achieves a very low word error rate (only 6.67% on average!).

Try the online demo on Hugging Face. See how to train & integrate the model for advanced use here (code included).

Model Architecture

Canary employs an encoder-decoder architecture, featuring a FastConformer and a Transformer Decoder as its main components (see the next picture).

The model translates speech to text with remarkable accuracy and flexibility. It takes the audio inputs and extracts the features using the FastConformer encoder. It adds special tokens with a task-specific prompt that includes your instructions (the desired language, task, and punctuation style) and passes them to the Transformer Decoder to generate text outputs.

Canary comprises a total of 24 encoder layers and 24 decoder layers. It applies different SentencePiece tokenizers for each language, and then merges them into one tokenizer. This approach allows for easy expansion of the model to new languages without major changes.

Training and evaluation

Canary’s training involved 85,000 hours of labeled speech data from various public and proprietary sources, along with the NVIDIA NeMo toolkit and the computational power of 128 NVIDIA A100 80GB GPUs.

You can replicate the training process using the provided example script and base configuration.

To assess its speech recognition capabilities, Canary was tested in four languages: English, Spanish, French, and German. Each test used the MCV 16.1 dataset, containing real-world speech samples. The model’s accuracy was measured using the Word Error Rate (WER) metrics, indicating the percentage of words either missed or misheard. Lower WER corresponds to better transcription accuracy.

See how Canary compared to other models in the next figure.

Speech recognition: average WER on MCV 16.1 test sets for English, Spanish, French, and German (source: blog)

Canary was tested for speech translation in both directions: English to Spanish, French, and German (left-hand panel), and vice versa (right-hand panel). They used different datasets, depending on the translation direction:

English to other languages: Fleurs and MExpresso datasets
Other languages to English: Fleurs and CoVoST datasets

The BLEU scores show higher values for better translation quality.

Speech translation: (left) average BLEU scores for translating from English. (right) average BLEU scores for translating to English (source: blog)

Compared to other models of similar size, such as Whisper-large-v3 and SeamlessM4T-Medium-v1, Canary achieves better results in both speech recognition and translation.

Conclusion

Canary is a multilingual speech-to-text recognition and translation model that achieves state-of-the-art results on several benchmarks.

The model is still under development, but it can be used in a variety of applications, such as real-time translation for meetings or video calls, generating subtitles for videos, and improving accessibility for people with hearing impairments.

Try Canary online demo following these steps:

Upload an audio file or record with your microphone
Select the language of the audio and the language you want to translate to (the input and output language)
Run the model and get the text output