TurboDiffusion (code, paper) is an acceleration framework that significantly reduces inference time and computational cost for existing video diffusion models. It can run efficiently on commonly available GPUs and can be used directly on a standard desktop workstation. It was developed by researchers at ShengShu Technology and the TSAIL Lab of Tsinghua University.

In this article we’ll look at what benefits TurboDifffusion brings, how it speeds up video diffusion, evaluation results, and how to run it locally.
TurboDiffusion key benefits
- 100–200× speedup: On high-end hardware like the RTX 5090, it can generate a 5-second video in under 2 seconds (tasks that previously took ~3 minutes).
- High quality: It maintains visual fidelity by using rCM (Re-rectified Consistency Model) for step distillation, which allows the model to reach the final output in just 3–4 steps instead of the usual 50–100.
- Consumer GPU optimization: Unlike many similar models that require industrial A100/H100 clusters, TurboDiffusion is optimized for consumer-grade GPUs like RTX 4090 and 5090. It applies W8A8 quantization, which compresses data to 8‑bit precision, cutting VRAM usage and increasing speed.
- Open-source: The GitHub repository contains the model checkpoints, training and inference code, along with pretrained weights for widely used models such as Wan2.1 and Wan2.2.
Why video diffusion is slow
Video diffusion models can produce high-quality videos, but they are computationally expensive and often slow. Even short clips may take minutes or hours to generate. This is because diffusion models work by reversing a noise process: real data, such as images, is gradually corrupted with Gaussian noise, and the model learns to remove this noise step by step. This process typically requires dozens or even hundreds of iterations before a clean image or video is produced.

This iterative refinement is computationally expensive even for a single image. For video generation, the cost increases significantly because the model must generate many frames while maintaining temporal consistency across them. A typical smooth video runs at 24–30 fps (frames per second), while higher-quality content may target 60 fps or more. For example, a 5-second clip at 30 fps contains 150 frames, all of which must be generated and coordinated during the denoising process.
Maintaining smooth motion adds another layer of complexity. Modern video diffusion models use spatiotemporal attention to capture relationships across both space and time, ensuring that objects remain stable, identities persist, and motion is coherent. As a result, both computation and memory requirements grow rapidly with video length and frame rate, meaning that even modest increases in duration can significantly raise processing costs.
How TurboDiffusion achieves high speedups
TurboDiffusion addresses this challenge by integrating 3 main techniques:
- Attention acceleration: Reduces the high computational cost of attention operations, which are one of the main bottlenecks in video diffusion models. It uses two mechanisms: SageAttention and sparse-linear attention. SageAttention is a low-bit (quantized) attention mechanism that lowers numerical precision with minimal quality loss. Sparse-linear attention (SLA) teaches the model to focus only on the most important parts of the input.
- Step distillation (e.g., rCM): TurboDiffusion reduces the generation process to just a few steps, sometimes as few as 2–4, while traditional diffusion pipelines typically require 50–100 denoising steps. This is achieved by training a fast student model to match the final output of a slower teacher model. Instead of removing noise step by step, the student learns to move directly from noisy input to a clean result in large steps.
- W8A8 quantization: It applies 8-bit quantization to model weights and activations in linear layers to reduce memory usage and accelerate inference.
Together, these enable real-time video generation on a single consumer GPU (such as an RTX 5090), which is a notable leap compared to existing pipeline.
The core idea of TurboDiffusion: Generate videos using only a few diffusion steps instead of dozens or hundreds, through more efficient computation, while maintaining visual quality comparable to baseline models. This enables new possibilities for AI video editing, real-time content creation, interactive storytelling, and scalable video generation services.
Evaluation results
Researchers conducted extensive tests using the Wan model series to evaluate the efficiency of TurboDiffusion. The experiments covered both Text-to-Video (T2V) and Image-to-Video (I2V) workflows across a variety of resolutions and model sizes.
By benchmarking TurboDiffusion against the original baseline models, the team demonstrated a significant reduction in processing time. The table below summarizes these results as measured on a single NVIDIA RTX 5090 GPU:
| Model variant | Baseline time | TurboDiffusion time | Speedup factor |
|---|---|---|---|
| Wan2.1 1.3B (480p) | ~184 seconds | 1.9 seconds | ~96.8x |
| Wan2.1 14B (480p) | ~1,635 seconds | 9.4 seconds | ~174x |
| Wan2.1 14B (720p) | ~4,767 seconds | 24 seconds | ~198.6x |
| Wan2.2 I2V 14B (720p) | ~4,549 seconds | 38 seconds | ~120x |
Importantly, these speed gains come with minimal loss of visual quality, meaning users can generate usable, high-fidelity video much faster than before.

Setting up TurboDiffusion on your local GPU
To work with TurboDiffusion, you need the following prerequisites:
- A CUDA-capable NVIDIA GPU: A GPU with sufficient VRAM is recommended, especially for larger video models or higher resolutions. Performance and memory requirements depend on the base model and configuration used.
- Python 3.10 or newer: This version is commonly used across modern deep learning projects.
- CUDA toolkit: Ensure that your CUDA version is compatible with both your NVIDIA driver and your PyTorch installation.
To get started, please refer to the TurboDiffusion repository. It has all the necessary code, documentation, and instructions. Here’s an overview of the steps:
- Clone the repository.
- Install dependencies: Use standard package managers (e.g., pip) to install Python packages and required ML libraries.
- Download model checkpoints: TurboDiffusion relies on pretrained model checkpoints provided by the authors. Pre-optimized checkpoints are provided for supported video diffusion models (e.g., TurboWan2.2 variants).
- Run inference: Users can generate videos from text or image prompts using the provided scripts. TurboDiffusion usually requires only a few inference steps (e.g., 2–4), which is a key factor in its speedup.
Conclusion
TurboDiffusion is an efficient framework that accelerates video diffusion models, delivering 100–200× faster generation while maintaining high visual quality. It directly addresses two major limitations of diffusion‑based video generation: high computational cost and slow production speed.
Because the framework is fully open‑source and designed to run on commonly available GPUs, including a single RTX 5090, it enables users ranging from independent creators to enterprise‑scale developers to integrate high‑performance, near‑real‑time video generation into their workflows.






