Meta’s SeamlessM4T can translate and transcribe speech and text across nearly 100 languages

August 30, 2023

Meta launched SeamlessM4T (Massively Multilingual & Multimodal Machine Translation), a new AI model that can translate, convert, and recognize speech and text for almost 100 languages.

According to Meta, SeamlessM4T is “the first all-in-one multilingual multimodal model that can transcribe and translate languages simultaneously”.

The system can perform five tasks for nearly 100 input languages and 35 (plus English) output languages:

speech-to-speech translation
speech-to-text translation
text-to-speech translation
text-to-text translation
automatic speech recognition (convert speech into text)

All-in-one system that performs multiple tasks across speech and text. Source

The model is open source and available for use in the code repository.

There are two ways to change speech from one language to another:

direct way translate speech directly into text (does not need to first write down the speech as text before translating it)
cascaded way that first transcribe the speech into text and then translate the text into another language

SeamlessM4T uses a direct method, but it is stronger than the previous ones because it combines two powerful models: a strong speech representation learning model (w2v-BERT 2.0) and a massively multilingual text-to-text translation model (SeamlessM4T-NLLB).

Overview of SeamlessM4T

The figure below shows the main parts of the SeamlessM4T model: (1) the pre-trained models and (2) the multitasking UNITY.

(1) The pre-trained models (used when finetuning multitasking UNITY):

SeamlessM4T-NLLB: a massively multilingual text-to-text translation model that is trained on a large corpus of parallel text in many different languages.
w2v-BERT 2.0: a speech representation learning model that is trained on a large corpus of unlabeled speech audio data. This model learns to represent the acoustic features of speech in a way that is informative for translation.
T2U: a Text-to-Unit sequence-to-sequence model that is trained to translate text into a sequence of acoustic units.
Multilingual HiFi-GAN unit vocoder: a model that can synthesize speech from a sequence of acoustic units.

(2) The multitasking UNITY handles different tasks involving speech and text, such as speech-to-speech translation, speech-to-text translation, text-to-speech translation, and text-to-text translation.

It contains these modules:

two encoders, one for text and one for speech
a text decoder
a T2U encoder-decoder (text-to-utterance that produces spoken words from written text)
the supporting vocoders for synthesizing output speech in S2ST (speech-to-speech translation)

The model is fine-tuned in three stages, which allows it to achieve better accuracy than models that are only fine-tuned in one stage.

Evaluation

SeamlessM4T’s performance was measured across all languages using both automatic methods (ASR-BLEU, BLASER 2) and human evaluation. It was also checked for reliability, fairness and toxicity.

The model showed a significant improvement over state-of-the-art models (see the figure below).

Translation quality measured on SeamlessM4T and state-of-the-art competitor models. Source

Conclusion

SeamlessM4T is the first model that can handle multiple tasks and modalities with a single system. It can work with nearly 100 input languages and 35 (plus English) output languages, enabling people to communicate with each other through speech or text in different languages.

The research team aims to make it accessible to everyone, so they share two versions of the SeamlessM4T model with different sizes: SeamlessM4T-Large and SeamlessM4T-Medium (2.3B and 1.2B params, respectively).

The model has some limitations to be addressed in the future, such as data quality, data scarcity, domain adaptation, and model size.