BAGEL from ByteDance, an open-source multimodal AI

Have you ever wondered how AI can generate detailed captions from images, answer questions about videos, or understand both text and visuals at the same time? This research proposes the architecture that enables it — introducing BAGEL (paper, code, project page, demo), an open-source multimodal model that unifies understanding and generation across different media types.

BAGEL is a unified multimodal foundation model developed by ByteDance, designed for both understanding and generating content across various data forms, including text, images, and videos. It supports tasks such as text-to-image creation, advanced image editing with flexible visual changes and 3D content handling, and multimodal reasoning, such as answering visual questions and performing step-by-step analysis. The model can also create or navigate virtual environments from multiple viewpoints.

ByteDance, the Chinese company behind BAGEL, is best known as the owner of the global short-video and e-commerce platforms TikTok and Douyin.

Showcase of the versatile abilities of the BAGEL model (source: paper)

The model is available under the Apache 2.0 license, making it freely accessible for research and development. You can explore BAGEL through its GitHub repository, project page, or Hugging Face model page.

BAGEL, the open-source unified multimodal model (source: project page)

Contents

  1. Key capabilities
  2. Model
  3. Training
  4. Evaluation
  5. BAGEL’s new abilities
  6. Text-to-image qualitative comparison
  7. Editing and manipulation task comparison
  8. How to use it
  9. Conclusion
  10. References

Key capabilities

  • Multimodal Chain-of-Thought reasoning – Handles prompts that combine text with visuals and reasons step-by-step to produce coherent outputs.
  • Image understanding and editing – Analyzes images with contextual understanding and applies edits that maintain visual coherence.
  • Text-to-Image generation – Generates photorealistic images from text prompts, with quality comparable to models like Stable Diffusion 3 Medium.
  • Multiview synthesis and 3D reasoning – Given a single image, BAGEL creates plausible alternative views and understands spatial layouts.
  • World navigation and dynamics modeling – Learns from video data to model motion, 3D perspectives, and navigation tasks.

These capabilities emerged naturally during training, without explicit programming for each task.

Model

BAGEL is a multimodal model built on a decoder-only Transformer architecture, initialized from Qwen2.5, which is a powerful and publicly available LLM. It adopts a Mixture-of-Transformer-Experts (MoT) design, using two specialized Transformer Experts to process understanding and generation (see Und. Expert and Gen. Expert in the next figure). For image processing, BAGEL uses two distinct visual encoders: one dedicated to image understanding and the other optimized for image generation.

Within each Transformer block, all tokens — whether textual or visual — participate in a shared multimodal self-attention mechanism, ensuring smooth integration of language and visual information.

BAGEL uses a Mixture-of-Transformer-Experts (MoT) architecture with two specialized experts (source: paper)

BAGEL combines next-token prediction for text with a diffusion-based approach (e.g., Rectified Flow) for generating visual content. Unlike traditional autoregressive models that predict one token at a time, BAGEL’s training framework adopts a Next Group of Token Prediction paradigm. This allows the model to learn and predict coherent groups of tokens, especially in visual and multimodal contexts, as unified targets. Thus, BAGEL more effectively captures cross-modal dependencies, generates more coherent outputs, and supports long-context reasoning. This approach also reduces the sequential bottlenecks typical of standard next-token prediction in multimodal tasks.

To further improve performance and stability, BAGEL incorporates advanced techniques from modern LLM and vision architectures, such as RMSNorm, SwiGLU, RoPE, and Grouped Query Attention (GQA) for efficient key-value (KV) caching. Additionally, QK-Norm is applied within each attention block, drawing inspiration from best practices in vision and video generation models.

QK-Norm refers to adding Layer Normalization to both query (Q) and key (K) vectors before calculating their dot-product attention. This step keeps the training process stable, especially when working with very large models. Without QK-Norm, Q and K values can grow uncontrollably during training.

Training

BAGEL was trained on trillions of interleaved multimodal tokens, including language, images, videos, and web content, enabling cross-modal interactions. The training process has four distinct training stages, each with specific objectives and training data:

  1. Alignment: Focuses on aligning the SigLIP2 ViT encoder with the Qwen2.5 LLM using image-text pairs for image captioning, with images resized to 378 × 378.
  2. Pre-training: Involves training with a corpus of 2.5T tokens, including various data types, while employing a native-resolution strategy for multimodal tasks.
  3. Continued training: Increases visual input resolution and emphasizes cross-modal reasoning. This stage consumes approximately 2.6T tokens.
  4. Supervised fine-tuning: Uses high-quality subsets from datasets for multimodal generation and understanding, totaling 72.7B training tokens.

Evaluation

BAGEL demonstrates superior performance and unique emerging capabilities across a wide range of multimodal tasks.

Multimodal understanding: BAGEL consistently outperforms current top-tier open-source Vision-Language Models (VLMs) like Qwen2.5-VL and InternVL-2.5 on standard multimodal understanding leaderboards. This includes benchmarks such as MME, MMBench, MMMU, and MM-Vet, indicating its strong ability to comprehend and reason about diverse visual and textual information.

ModelMME-P
(Multimodal Evaluation – Perception)
MMBench
MMMUMMVet
Chameleon-7B35.728.48.3
Show-o-1.3B109726.7
Emu3-8B124458.531.637.2
TokenFlow-XL-13B154668.938.740.7
Janus-Pro-7B156779.24150
MetaQuery-XL-7B168583.558.666.6
BLIP3-o-8B168383.550.666.6
BAGEL16878555.367.2
Multimodal understanding performance across models (source: project page)

Multimodal generation: BAGEL demonstrates high performance across a wide range of object-centric visual reasoning and attribute understanding tasks, showcasing exceptional abilities in single and two-object recognition, counting, color comprehension, and solid understanding of object position and color attributes.

ModelSingle ObjectTwo ObjectCountingColorsPositionColor AttributeOverall
Chameleon-7B0.39
Show-o-1.3B0.980.800.660.840.310.500.68
Emu3-8B0.990.810.420.800.490.450.66
TokenFlow-XL-13B0.950.600.410.810.160.240.55
Janus-Pro-7B0.990.890.590.900.790.660.80
MetaQuery-XL-7B0.80
BLIP3-o-8B0.84
BAGEL0.980.950.840.950.780.770.88
Visual generation capabilities across models (source: project page)

BAGEL’s new abilities

As BAGEL is trained on more data, it gets better at tasks like understanding images, generating content, and editing visuals. But these skills appear in different stages:

  1. Early stage: The model first develops multimodal understanding and high-fidelity generation, meaning it can recognize and create relevant text or visuals based on the provided data.
  2. Intermediate stage: It gains the ability to perform basic editing, such as adjusting image features or modifying textual descriptions.
  3. Advanced stage: More complex intelligent editing emerges, allowing the model to make refined adjustments with greater reasoning and contextual awareness.
BAGEL emerging curves (source: paper)

This gradual progression suggests that advanced skills naturally emerge as the model builds on strong foundational abilities. The picture below demonstrates that BAGEL can perform chain-of-thought reasoning. This allows it to break down and think through complex multimodal tasks step-by-step before generating an output, leading to more accurate and intelligent results.

BAGEL’s thinking (Chain-of-Thought Reasoning) capability (source: paper)

Additionally, ablation studies reveal that combining VAE and ViT features enables intelligent editing.

Text-to-image qualitative comparison

In addition to English, BAGEL supports Chinese prompts and allows for generation with arbitrary aspect ratios. In a qualitative comparison (see the picture below), the model generates higher-quality images than Janus-Pro 7B and outperforms the text-to-image model SD3-medium. Unlike SD3-medium, which necessitates translating Chinese prompts into English, GPT-4o, which depends on text prompts to control aspect ratios, and JanusPro, which is restricted to square images.

T2I qualitative comparison (source: paper)

Editing and manipulation task comparison

The next picture showcases qualitative comparisons of BAGEL’s performance across various image editing scenarios. The comparisons highlight BAGEL’s ability to handle free-form visual manipulation, multiview synthesis, and world navigation, demonstrating its superiority over other open-source models.

Comparison on editing and manipulation tasks (source: paper)

How to use it

ByteDance’s BAGEL is open-source (Apache License 2.0) and there are several ways to use it (for more details, please visit its official GitHub repository):

  • Through online demos: BAGEL is available for free online, where you can interact with it via a web interface. There’s also a Hugging Face Space offering image-oriented tasks.
  • Running locally: Clone the GitHub repository, set up the environment, download pretrained checkpoints, and run. GPU Requirements: BAGEL is a large model (7B active parameters, 14B total) and requires significant GPU resources, typically at least a 24GB VRAM GPU (like an NVIDIA L40S, or multiple GPUs for larger models/modes). Quantization options might help run it on GPUs with less VRAM (e.g., 12-32GB).

Conclusion

BAGEL represents a significant advancement in open-source multimodal AI, offering a unified model capable of complex reasoning and generation across multiple data types. It adopts a decoder-only Mixture-of-Transformer-Experts (MoT) architecture and is pretrained on trillions of interleaved text, image, video, and web tokens, enabling strong performance in advanced multimodal understanding.

BAGEL not only outperforms existing open-source models in both multimodal understanding and generation, but also proves advanced capabilities such as free-form image editing, world navigation, and sophisticated object-centric reasoning.

References

Other popular posts