Alibaba launched Wan2.1 (paper, repo, project page), an advanced and open-source video generative model competing with leading open models such as HunyuanVideo and high-performance closed-source systems like OpenAI’s Sora.

Key features
- High performance: Outperforms both open-source and commercial models across a wide range of benchmarks, such as text-to-video quality, temporal consistency, text rendering accuracy, and motion realism.
- Optimized for consumer hardware: The T2V-1.3B model runs on just 8.19 GB of VRAM, making it accessible to most consumer-grade GPUs. It can generate a 5-second 480p video on an RTX 4090 in approximately 4 minutes – without requiring optimization techniques.
- Supports multiple input modalities: Wan2.1 can perform Text-to-Video (T2V), Image-to-Video (I2V), Video Editing, Text-to-Image (T2I), and Video-to-Audio (V2A) generation tasks.
- Supports Chinese and English visual text generation: It is the first video model to generate robust visual text in both Chinese and English.
- Introduces a unique Video Variational Autoencoder (VAE): The Wan-VAE module compresses the spatio-temporal dimension of a video by 4 × 8 × 8 times. This makes it well-suited for integration with diffusion-based generative models such as DiT.
How to use Wan2.1
Wan2.1 is open source (including the source code and pretrained model weights) under the Apache 2.0 license, making it freely available for research and academic purposes. It can be accessed and downloaded via GitHub and Hugging Face, allowing users to run it either online or locally. For detailed instructions on installation and usage, please refer to the official Wan2.1 documentation and repository.
Access Wan2.1 online. You can try Wan2.1 directly online through various platforms. For example, on Hugging Face Spaces, simply use the provided demo interface.
Enter a text prompt or upload an image, then adjust settings such as resolution (480p or 720p), frame rate, or video duration depending on the platform’s options. Once configured, run the model to generate your video. Generation times vary — for instance, creating a 5-second 480p video on Hugging Face typically takes around 4 minutes. After processing, you can download the generated video or share it directly from the platform.
Download and run Wan2.1 locally. Clone the repository, install the necessary dependencies, download the desired model type (T2V, I2V, FLFV, or T2I) and then run the application. The models are available via Hugging Face or ModelScope and can be accessed using either huggingface-cli or modelscope-cli.
Before installing dependencies, ensure that your system meets the following minimum requirements:
- GPU (T2V-1.3B model): A single consumer-grade GPU with at least 8.2 GB VRAM (e.g., NVIDIA RTX 3060 or higher).
- GPU (14B models): Multi-GPU setup with high memory capacity (e.g., 4×A100 or similar).
- Operating System: Linux or Windows (Linux preferred for CUDA support).
- Dependencies: Python ≥ 3.8, PyTorch ≥ 2.0, and other packages specified in the official GitHub repository.
Wan2.1 offers several options to improve the video quality and provide greater control, such as prompt enhancement, aspect ratio control, inspiration mode, and sound generation. Additionally, Wan2.1 allows fine-tuning through techniques like LoRA and ControlNet, or using custom datasets. Training scripts and configuration examples are provided in its official repository.
Model design
The diagram below illustrates the core architecture of Wan, which consists of three main components: the Wan-Encoder, the Diffusion Transformer (DiT), and the Wan-Decoder.
Pipeline: Input (text/image/video) → Wan-Encoder (compresses input into latent representation) → Diffusion Transformer (DiT) (performs denoising and generation in latent space) → Wan-Decoder (reconstructs high-quality video frames from latent outputs).

Wan-Encoder is based on Wan-VAE (Variational Autoencoder), a special type of neural network designed to efficiently compress and reconstruct videos. Wan-VAE compresses videos into a smaller, simpler representation that keeps videos’ spatial and temporal information. This compression makes it possible to handle long, high-resolution videos (like 1080p) without requiring huge amounts of computer memory. A T5 Encoder (Text Encoder) converts the input text prompts (in English, Chinese, or both) and embeds them into the DiT via cross-attention, conditioning the video generation on the input description.
DiT (Diffusion Transformer) is the part of the model that actually generates the video frames step-by-step. It starts with random noise and gradually refines it into clear video frames, guided by the input text or images. The DiT architecture is primarily composed of three key elements: a patchifying module for dividing the input into patches, a series of transformer blocks for processing these patches, and an unpatchifying module for reconstructing the output from the processed patches.
The Wan-Decoder transforms the latent representations generated by DiT into video frames.
Data: The figure below illustrates how the proportions of training data, determined by motion characteristics, quality levels, and content categories, are passed across different stages of the training process. These proportions are adjusted based on the data throughput achieved at each stage.

Main features of the Wan2.1 models
The performance of the Wan2.1 models is the result of several key innovations, including a new 3D causal spatio-temporal VAE for efficient video compression, multi-resolution pre-training strategies, large-scale data curation comprising billions of images and videos, and automated evaluation metrics. The table below summarizes their key features.
Model | Requirements/Performance |
---|---|
Wan2.1-T2V-14B Text-to-Video (T2V) (supports 480P and 720P) | Requires multi-GPU inference and stands as the only video model capable of generating both Chinese and English text, delivering high-quality motion and scene rendering through advanced architectures such as Flow Matching and 3D Causal VAE. |
Wan2.1-T2V-1.3B Text-to-Video (T2V) (supports 480P) | Requires approximately 8.19 GB VRAM, optimized for consumer-grade GPUs, and delivers performance on par with certain closed-source models. |
Wan2.1-I2V-14B-480P Wan2.1-I2V-14B-720P Image-to-Video (I2V) | Require multi-GPU inference to convert static images into dynamic videos, generating realistic motion patterns and fluid transitions. |
Wan2.1 FLF2V (supports 720P) First-and-Last-Frame-to-Video (FLF2V) | Requires multi-GPU inference to create videos with seamless transitions between defined start and end frames, ensuring high precision and adaptability while integrating advanced capabilities like LoRA and ControlNet. |
Evaluation
During evaluation, Wan2.1 was compared with some leading open-source and commercial video generative models across a range of benchmarks. The analysis included both quantitative and qualitative metrics, focusing on various aspects such as video realism, temporal coherence, motion fidelity, and alignment with the input prompts.
Quantitative results: The image below shows the evaluation results of eight VAE models, comparing their video quality — measured by Peak Signal-to-Noise Ratio (PSNR) — and processing efficiency, indicated by frames per second (FPS) per unit of latency. Latency refers to the time delay between an input action and its corresponding output, with lower latency improving responsiveness.
While many models rely on a standardized comparison setup, with a 4×8×8 compression rate and a latent dimension of 16, as seen in Wan-VAE, this study also investigates the impact of alternative settings on performance. Specifically, it evaluates Open Sora Plan with a latent dimension of 4, SVD using a compression rate of 1×8×8, Step Video at 8×16×16, and Mochi at 6×8×8, thereby providing a broader analysis of how different model configurations influence results.

The findings demonstrate that Wan-VAE achieves a strong balance between video quality and speed. Notably, it is 2.5 times faster than the previous state-of-the-art model (HunYuan Video) when tested on the same hardware.
Qualitative results: The team tested Wan-VAE’s ability to reconstruct videos in different challenging scenes, including textures, faces, text, and fast motion.

Compared to other VAE models, Wan-VAE performed better in capturing fine details. For example, it accurately showed hair texture and direction, preserved facial features with less blurring around the lips, clearly restored text without distortion, and kept motion sharpness in fast-moving scenes.
Comparison with leading models using Wan-Bench
The team conducted a comprehensive comparison of Wan-14B against leading models, using the Wan-Bench benchmark developed by Alibaba. The detailed results are shown in the table below, where the top-performing model in each category is highlighted in bold.

Wan-14B achieves the highest overall weighted score of 0.724, calculated as a weighted sum of individual dimension scores based on human preference alignments. The model demonstrates competitive or superior performance in generating large motions, maintaining temporal coherence, and preserving fine-grained visual details.
Conclusion
Wan2.1 is an open-source, high-quality generative video system capable of handling multiple input types (text, image, video). It supports a wide range of tasks, including Text-to-Video, Image-to-Video, Video Editing, Text-to-Image, and Video-to-Audio, making it a versatile tool for multi-modal content creation.
Using advanced innovations, especially the new Wan-VAE, the model optimizes video encoding and decoding by using a minimal number of visual tokens while maintaining rich detail and contextual accuracy. This enables it to generate high-quality, realistic videos at 1080P resolution with unlimited length, all while preserving historical temporal context.
Wan2.1-T2V-1.3B is optimized for consumer-grade GPUs, ensuring accessibility for a wider audience, including content creators, researchers, and developers.