LLaMA-Omni lets you speak to LLMs and get instant responses

September 29, 2024

LLaMA-Omni is an open-source AI tool designed for real-time voice interaction with large language models. It takes speech inputs and generates both text and spoken responses almost instantly.

The code is available under the Apache-2.0 License and can be accessed here. Since the model is based on Llama-3.1-8B-Instruct, it complies with the Llama 3.1 License requirements, which include specific conditions like incorporating “Llama 3” in the name of any derived AI models.

LLaMA-Omni simultaneously generates text and speech responses from speech instructions, with very low response latency (source: paper)

Speech-based interactions are considered more conversational and intuitive than text, which is why proprietary models like GPT-4 now support real-time interaction through speech.

However, there is still a gap in developing speech-based capabilities for open-source large language models (LLMs). LLaMA-Omni is designed to address this gap by providing low-latency and high-quality speech interactions, enabling simultaneous generation of both text and speech responses based on speech user’s instructions.

Key advancements

High-quality responses: It surpasses previous speech-language models in both content and style. This improvement is due to the use of Llama-3.1-8B-Instruct model and a specially constructed dataset, InstructS2S-200K.
No speech transcription needed. It can listen to what you say and immediately respond in both written and spoken forms. The model achieves this by integrating 5 key components: pre-trained speech encoder, speech adaptor, LLM, speech decoder, and vocoder.
Low-latency speech interaction: The response time is as low as 226 ms.
Simultaneous generation of both text and speech responses from spoken instructions.
Training efficiency: It was trained in less than 3 days using only 4 NVIDIA L40 GPUs.

The model

LLaMA-Omni’s architecture can be seen in the next picture.

Model architecture of LLaMA-Omni (source: paper)

The following components enable the system to effectively process spoken inputs and generate coherent text and speech outputs:

Pretrained speech encoder: Converts the user’s speech instructions into a format that the LLM can process. The encoder’s parameters are frozen throughout the entire training stage.
Speech adaptor: Aligns the speech inputs with the LLM’s text-based architecture. It is specifically designed to bridge the gap between the speech encoder and the LLM.
LLM: Serves as the core component. It processes the speech input and generates appropriate responses in text form (tokens) and hidden state vectors.
Speech decoder: Takes the hidden state vectors and transforms them into abstract speech representations (speech units).
Vocoder: Converts these abstract speech representations into actual, human-sounding speech.

Training

As shown in the next figure, the model followed a two-step training process:

Train the model to generate text responses from speech commands. During this step, the speech encoder is kept frozen, while the speech adaptor and the LLM are trained together. The speech decoder is not used at this stage.
Train the model to generate speech responses. The speech encoder, speech adaptor, and LLM are frozen, and only the speech decoder is trained. This method allows the model to learn both text and speech responses effectively.

Illustration of the two-stage training strategy for LLaMA-Omni (source: paper)

Data collection and preprocessing

The training relies on the InstructS2S-200K dataset, which contains 200K pairs of speech instructions and corresponding text or speech responses. These pairs are carefully curated to ensure high-quality interactions that mimic real-world conversational scenarios.

Before training, the speech data is converted into a suitable form using models like Whisper-large-v3. This speech encoder converts raw audio signals into meaningful vector representations, which are fed into the training pipeline.

Inference

During inference, the LLM generates text responses step by step based on voice instructions. While generating the text response, the LLM also produces hidden states for each token. The hidden states are essentially vector representations that capture the meaning of each word or token the model generates. They are passed to a speech decoder that generates the discrete sequence of speech units.
The speech decoder is a non-autoregressive streaming Transformer. It takes the LLM output hidden states and predicts the sequence of discrete units corresponding to the speech response. Once the number of generated units reaches a predefined chunk size (Ω), this segment is sent to the vocoder.
The vocoder takes these discrete units and creates a speech segment, which is then immediately played to the user. This enables users to start listening to the speech response without waiting for the entire text response to be generated.

This approach allows the system to generate both text and speech responses to the user with minimal delay and high quality. The text is displayed on the screen, while the speech is played as audio.

Evaluation results

The performance of LLaMA-Omni has been rigorously evaluated, demonstrating significant improvements over previous models. LLaMA-Omni is capable of generating high-quality text and speech responses simultaneously, with a latency as low as 226 milliseconds.

The following table shows that the average speech-to-speech generation (the S2SIF task) is 1.92 seconds, roughly 13 times faster than SpeechGPT.

Model	S2TIF (s)	S2SIF (s)
SpeechGPT	4.28	25.60
SALMONN	4.78	/
Qwen2-Audio	8.42	/
LLaMA-Omni	1.49	1.92

Average decoding time (seconds per instruction) of different models on speech-to-text instruction-following (S2TIF) and speech-to-speech instruction-following (S2SIF) tasks (source: paper)

Furthermore, in comparison to earlier speech-language models such as SpeechGPT, LLaMA-Omni requires less training data and computational resources.

Prompt template of LLaMA-Omni (source: paper)

Using LLaMA-Omni

Install the dependencies, download the Llama-3.1-8B-Omni model from Hugging Face, the Whisper-large-v3 model, and the unit-based HiFi-GAN vocoder. Then set up by starting the controller and launching the Gradio web server. You can now run the model. For more details, check the model’s repository.

Conclusion

LLaMA-Omni makes interactions with LLMs more natural and efficient. Its key innovations include a pretrained speech encoder, a specialized speech adapter, and a powerful vocoder. These elements help the understanding and generation of human-like speech.