Meta’s Llama 4, advanced multimodal models with long context

April 10, 2025

Meta released Llama 4, a new suite of AI models which offers advanced capabilities such as multimodal processing and extended context windows, built upon a mixture-of-experts architecture.

Llama 4 is the engine behind the “Meta AI” assistant integrated into platforms like WhatsApp, Messenger, and Instagram.

Llama 4 model family

Released on April 5, 2025, it consists of 3 distinct models.

Llama 4 Scout is the smallest model, optimized to run efficiently on a single Nvidia H100 GPU. It has a total of 109B parameters, with 17B active parameters distributed across 16 experts. This design makes it well-suited for long-context tasks such as multi-document summarization and code analysis.
Llama 4 Maverick has a total of 400B parameters, with 17B active parameters distributed across 128 experts. It is designed for multimodal tasks, such as image understanding and text-vision reasoning.
Llama 4 Behemoth, still in training, has 2T total parameters and uses 288B active parameters distributed across 16 experts. This model is expected to outperform GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro in STEM (science, technology, engineering, and mathematics) benchmarks.

According to Meta, Llama 4 represents the most advanced suite of models to date, placing it in direct competition with industry leaders like GPT-4, Gemini, and DeepSeek series.

Key points

Llama 4 introduces several innovations that set it apart from its predecessors (Llama 3, Llama 3.1) and competitors:

Extended context windows: Scout has a 10M-token context window, allowing for the seamless processing of extensive datasets such as entire books or hours of video transcripts, while Maverick offers a 1M-token context window, ideal for complex reasoning tasks.
Multimodal capabilities: For the first time in the Llama series, Llama 4 introduces native multimodal support, enabling it to process text, images, and video inputs. This feature brings it closer to competitors like Gemini 2.5.
Mixture-of-Experts architecture (MoE): This approach activates only the necessary parts of the model for a given task, improving the computational efficiency and scalability.
Parameter efficiency: While both Scout and Maverick use 17B active parameters, they differ significantly in their expert architecture. Scout employs 16 experts to manage these parameters, whereas Maverick uses a much larger set of 128 experts, suggesting different strategies for efficient parameter allocation.

How to use it

You can download the open-source Llama 4 models, Scout and Maverick, at llama.com and Hugging Face. As we mentioned above, Meta AI – built on Llama 4 – is integrated directly into platforms like WhatsApp, Messenger, Instagram Direct, and on the web browser (not available in the European Union).

Integration options:

APIs: Standard APIs are available for easy deployment in applications such as chatbots and coding assistants.
Fine-tuning: Developers can use PyTorch-based tools to customize Llama 4 models for specific use cases. For instance, the LlamaIndex library provides resources for fine-tuning models like Llama 2, which can be adapted for Llama 4.
Enterprise applications: Meta’s LlamaIndex supports retrieval-augmented generation (RAG) pipelines for advanced workflows in enterprise settings.

However, Llama 4’s new license comes with several limitations. Companies with more than 700 million active users must request a special license from Meta. You must prominently display “Built with Llama” on websites, interfaces, documentations, and any AI model created using Llama Materials must include “Llama” at the beginning of its name. You must also include the specific attribution notice in a “Notice” text file with any distribution.

Not available in the European Union

EU firms/individuals are restricted from installing/training Llama 4 models due to the requirements of the EU AI Act. However, EU end users can access services powered by Llama 4 if those services are based outside the EU.

Minimum hardware setup required

The next table provides a general guideline for the minimum setup required to run LLama 4 Scout and Maverick in 4-bit precision. Adjustments may be necessary based on specific use cases and available resources.

Model	Minimum GPU	Minimum RAM	Minimum storage	Notes
Llama 4 Scout	H100: 1× H100 (80 GB VRAM) or better. H200: 1× H200 (96 GB VRAM) A100: 2× A100 (80 GB VRAM each)	At least 64 GB	60-70 GB	Running with 10M context requires 256 GB RAM.
Llama 4 Maverick	H100: At least 4× H100 GPUs (total 320 GB VRAM). H200: 3× H200 GPUs (288 GB VRAM) A100: At least 5× A100 GPUs (400 GB VRAM)	At least 128 GB	210-250 GB	1M context requires multi-GPU setup and at least 256 GB RAM

Hardware requirements (our estimations)

These setups assume optimized software (e.g., Transformers, vLLM) and small context lengths for practical inference. Scout is more accessible, while Maverick demands more resources.

Training

Pre-training (see the next picture): Llama 4 is Meta’s first AI model series to incorporate the MoE architecture, optimizing its computational workflow. By activating only a subset of parameters per token, MoE improves both training and inference efficiency.

These models are natively multimodal, using early fusion – a technique in multimodal AI where text, images, and video are combined at the beginning of the model’s processing. This allows the model to learn the relationships between different data types from the start, leading to more accurate and natural interactions across modalities.

The training dataset consists of 30T tokens, more than twice the size of Llama 3’s dataset. Mid-training optimizations further refine performance, enabling Llama 4 Scout to handle input contexts of up to 10M tokens.

Llama 4 pre-training by alternating dense and mixture-of-experts (MoE) layers for inference efficiency. Source: Meta

Post-training: To balance multimodal inputs, reasoning, and conversation, Llama 4’s post-training process uses a refined pipeline: lightweight supervised fine-tuning (SFT) → online reinforcement learning (RL) → lightweight direct preference optimization (DPO). The process focused on medium-to-hard prompts to achieve higher accuracy in reasoning, coding, and math. A continuous online RL strategy further improved efficiency and intelligence.

Evaluation

Meta has conducted extensive benchmarking to compare Llama 4 with other leading AI models. The next figures illustrate some of the evaluation results.

Llama Maverick outperforms models like GPT-4o and Gemini 2.0 Flash, achieving results comparable to DeepSeek V3 on reasoning and coding benchmarks. Llama 4 Maverick’s experimental chat version reached a strong 1417 ELO on LMArena, showcasing its superior performance-to-cost. This high score was achieved using a version optimized specifically for conversational performance, which differs from the publicly released model. The discrepancy raised concerns about benchmark integrity and transparency.

Llama Scout demonstrates superior performance in long-context tasks, outperforming models like Google’s Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across various benchmarks.

The Needle in a Haystack (NIAH) benchmark in Llama 4 evaluates a model’s capability to extract specific information from large text datasets. In this test, a unique detail (the “needle”) is embedded within a lengthy document (the “haystack”), and the model is prompted to retrieve and recall it. A higher success rate indicates stronger long-context memory and retrieval skills. (See the next image, where blue represents success and white indicates failure.)

Llama 4, especially its Scout variant, has been optimized for long-context understanding, handling up to 10M tokens. Scout achieves state-of-the-art performance on NIAH benchmarks with no failure (no white square), demonstrating its ability to accurately extract information from extremely long documents.

What users are saying

The models are both giant MoEs that can’t be run on consumer GPUs, even with quant. (Jeremy Howard)
They didn’t release new 1/3/8B models like previous times. (Julien Lauret)
These frontier models should serve as foundational teacher models for smaller, distilled, quantized, and context-specific models. This aligns perfectly with agentic architectures. DeepSeek has taught us a lesson or two in this regard. (Evandro Reis)

Conclusion

Llama 4 marks a new step in AI development, combining efficiency, scalability, and multimodal intelligence. It is open source, and comes with innovative features, positions it as a versatile tool for researchers, developers, and enterprises.

Those interested in exploring Meta’s upcoming research, prototypes, and product vision are invited to register now for their conference, LlamaCon, on April 29th.