Cosmos simulates physical worlds for training AI systems

February 3, 2025

NVIDIA has released the Cosmos World Foundation Model Platform, an advanced AI toolkit specifically designed for developing physical AI systems.

The Cosmos Platform offers pre-trained World Foundation Models (WFMs) specifically for robotics and self-driving cars. It also provides a suite of tools and resources for simulating, training, and optimizing physical AI systems within controlled environments. These tools include video tokenizers, and a video data curation pipeline, NVIDIA NeMo Curator.

Additionally, using NVIDIA NeMo Framework, a GPU-accelerated platform for training and fine-tuning, developers can fine-tune Cosmos’ WFMs or build new models from scratch.

These models and tools are available via NVIDIA’s open license model on Hugging Face and NVIDIA NGC catalog.

Cosmos World Foundation Model Platform (source)

The team plans to offer Cosmos WFMs as NVIDIA NIM (NeMo Inference Mode) microservices, simplifying their integration into diverse workflows.

Deploying generative AI in production with NVIDIA NIM (source)

Motivation: Before being deployed in real-world applications, physical AI systems must undergo rigorous training to acquire a deep understanding of real-world physics and natural behaviors. Training these systems directly in real-world scenarios can be costly, time-intensive, and even dangerous.

To address these challenges, training is conducted in digital environments that provide safe and controlled conditions. However, accurately replicating real-world environments in a digital setting remains a considerable challenge.

Solution: Cosmos offers a set of World Foundation Models (WFMs). They can generate complex environments that closely mimic real-world conditions. In addition to generating data, developers can customize Cosmos WFMs by fine-tuning them for specific use cases. For example, they can simulate various weather conditions and terrains in a digital environment when designing delivery drones.

Key components

The Cosmos platform provides the following key components:

1. Video processing and curation pipeline: A built-in video curation pipeline that simplifies the process of collecting, organizing, and processing video content.

2. Video tokenizers: They convert the video data into formats suitable for AI training: continuous (latent vectors) and discrete (integers), achieving significant compression rates.

3. Pre-trained WFMs: The Cosmos WFMs are generative AI models that have been pretrained on 9,000T tokens that includes 20M hours of data from various fields such as autonomous driving, robotics, synthetic environments, and related fields. These models use two architectures: diffusion and autoregressive.

3a. Diffusion models: They are widely used for generating images, videos, and audio because they produce high-quality, realistic outputs. The training data is gradually corrupted with noise and the model learns to remove this noise step-by-step, reconstructing the original data.

The picture below illustrates the Cosmos-1.0-Diffusion WFM.

Overall architecture of Cosmos-1.0-Diffusion World Foundation Model (source: paper)

The system processes an input video by first converting it into tokens using the Cosmos-1.0-Tokenizer-CV8x8x8 encoder. After adding a Gaussian noise, the tokens are broken into smaller parts through a 3D patchification process.

For specific scenarios, a text prompt can be provided. If a text prompt is provided, it is integrated with the video data using a T5 text encoder. The model then processes this information through multiple layers, including a Multi-Layer Perceptron (MLP), to refine the data. Finally, the decoder reconstructs the video, producing a clear output.

3b. Autoregressive models: Designed for processing sequences, these models generate videos by predicting future frames based on text input and past video frames.

Architecture of Cosmos-1.0-Autoregressive-Video2World Model (source: paper)

The model undergoes progressive pretraining, beginning with predicting up to 17 future frames from a single input frame, then extending to 34 frames, and eventually up to 121 frames (or 50,000 tokens).

4. WFM post-training samples: The pre-trained VFM is further fine-tuned for specific applications, such as camera control, robotic manipulation, and autonomous driving. Because the pre-trained VFM already understands physics and general behaviors, post-training uses smaller datasets, leading to significant savings in both time and resources.

5. A guardrail system ensures safety by using a pre-Guard to block harmful inputs and a post-Guard to prevent problematic outputs, protecting both users and systems from potentially unsafe or undesirable content generated by AI models.

Training data: The team processed 20M hours of raw video (720p–4K) to generate 100M clips for pre-training and 10M for fine-tuning specific applications.

Evaluation results

NVIDIA uses Cosmos benchmarks to assess the accuracy of WFMs in simulating real-world physics. In addition to the traditional benchmarks for video generation (fidelity, temporal consistency, generation speed), Cosmos benchmarks introduce two criteria—3D consistency and physics alignment—to meet the precision requirements of Physical AI systems.

Tested on static scenes from a curated subset of 500 videos, Cosmos models outperformed the baseline VideoLDM model, showing superior geometric alignment and higher-quality synthesized views. Additionally, its models exhibited strong adherence to physical laws, especially when provided with more conditioning data.

However, issues such as object impermanence and implausible behaviors were observed and need to be addressed in future research.

Conclusion

Developing AI for real-world applications is challenging due to factors like unpredictable environments, real-time decision-making, and the need for extensive training data. The NVIDIA Cosmos platform simplifies this process by providing digital twins of real-world environments and pre-trained World Foundation Models (WFMs). These resources significantly accelerate AI development for robotics, autonomous vehicles, and a wide range of other applications.

Whether you’re an experienced developer or just starting in the field, Cosmos provides an open-source platform to explore, innovate, and put into practice your physical AI projects.