Create long AI videos locally with FramePack from Stanford

FramePack is a next-frame prediction neural network for high-quality and efficient video generation, optimized to run on consumer PCs with 6 GB of VRAM.

FramePack introduces a more accessible and efficient method for long video generation. Instead of processing each frame individually, it compresses input frames into fixed-size representations, independent of video length.

This innovation brings the computational cost down to the level of standard image diffusion. It enables the generation of thousands of frames at 30fps with 13B parameter models on a modest 6 GB laptop GPU.

After fine-tuning with FramePack, existing video diffusion models demonstrate significantly improved visual output quality. The authors integrated it with two models – Wan2.1 and an upgraded version of HunyuanVideo – resulting in much higher-quality videos.

The project was developed by Lvmin Zhang (the mind behind ControlNet and Fooocus) and Maneesh Agrawala from Stanford University.

Source: compilation of video clips from the project page

How to use it

FramePack is an open-source project licensed under Apache 2.0. Its official implementation can be found on the project’s GitHub repository, along with detailed setup instructions. Users can choose between two available options:

Option 1: manual setup (Linux or Windows). Clone the repository, create a virtual environment, install dependencies, download pretrained models, upload an image and start generation.

Option 2: one-click installer for Windows. Download the zip folder, uncompress it, click update.bat, run.bat to run the UI, upload an image and start generation.

Key points

  • Can generate a large number of video frames on 6 GB GPUs. FramePack keeps memory usage low by compressing inputs into a fixed-size format. This allows it to generate thousands of video frames at 30 FPS using a 13B parameter model, even on GPUs with just 6 GB of VRAM.
  • Can be trained with a much larger batch size, up to 64, similar to image diffusion models. Training can be done on a single computing node with up to 8 A100 or H100 GPUs, making it suitable for both research and personal projects.
  • Achieves high frame generation speeds. With default settings, an RTX 4090 graphics card can produce video frames in approximately 2.5 seconds per frame. By applying the teacache optimization, generation time is reduced to around 1.5 seconds per frame.
  • No timestep distillation applied. For high fidelity, FramePack does not use timestep distillation.

Challenges in traditional video diffusion

Traditional video diffusion models predict each frame based on previously generated noisy frames. As videos become longer, the temporal context grows, leading to increased memory usage and heavier computational demands.

These models also suffer from forgetting, where earlier content fades from memory and causes inconsistencies, and from drifting, where accumulated prediction errors progressively degrade visual quality over time.

How FramePack solves these challenges

Reducing temporal context: FramePack reduces the input context to a fixed length, ensuring that the computational workload remains stable regardless of the video’s duration.

Anti-forgetting: It transforms each input frame into a compact, fixed-size summary that preserves the essential visual information, eliminating the need to track every frame individually.

Anti-drifting: It uses a novel anti-drifting sampling method that generates frames in reverse order, starting from endpoints.

In simple terms, FramePack is like a very efficient video summarizer. It converts all the previous frames into a compact, fixed-length summary, focusing on what just happened but also keeping a general idea of what happened earlier. This summarization allows the model to handle longer videos without being overwhelmed by the expanding temporal context.

How FramePack compresses and prioritizes frames

To compress the input frames, the model divides each frame into smaller squares, or patches, and extracts the most relevant information from them. Recent frames are prioritized for predicting the next frame, as they provide crucial context, while earlier frames are represented with less detail. This strategy allows the model to generate the next frame without processing every past frame in full detail.

The research team evaluated various scaling factors and compression rates for input video frames to examine their effects on performance, efficiency, and output quality. The five different ablation variants (a-e) are illustrated in the figure below.

FramePack ablation variants (source: paper)
  1. Geometric progression: This setup uses a standard geometric scaling of context, halving memory usage at each level. Each step compresses the frames more and more (1, 1/2, 1/4, 1/8…), resulting in efficient memory handling and consistent performance with increasing video length.
  2. Progression with duplicated levels: Instead of continually scaling down, this variant holds some compression levels constant for several layers. This allows the model to spend more computation on specific resolution levels, for better detail retention or feature stability.
  3. Geometric progression with temporal kernel: This structure combines geometric compression with temporal kernels, meaning multiple frames are processed together as a group (tensor). This variant improves temporal consistency.
  4. Progression with important start: The very first frame (F0) gets the full context length, preserving all its detail. This might be useful if the beginning of the video is especially important, when early frames guide the overall scene structure or motion trajectory.
  5. Symmetric progression: A balanced version where all starting frames are treated equally, without giving priority to specific frames. This can be beneficial for tasks requiring consistent context application across the timeline.

The goal of these designs is to optimize GPU memory usage, allowing for efficient video creation on hardware with limited resources.

Main features

FeatureSpecifications
Model typeNext-frame prediction neural network for video synthesis
Model size13B parameters
Frame generation rateUp to 30 FPS (frames per second)
Hardware requirementsRuns on consumer GPUs with 6 GB VRAM (e.g., RTX 3060 laptops)
Training batch sizeUp to 64 (similar to image diffusion models)
Training hardwareSingle compute node with 8 × A100 or H100 GPUs (for fine-tuning)
Sampling speed (on RTX 4090)~2.5 seconds per frame (standard) → ~1.5 seconds per frame (with teacache optimization)
Context compressionCompresses input into a fixed-length context regardless of video length
Diffusion approachFull denoising steps per frame (no timestep distillation for higher visual fidelity)
Output typeLong-form, high-quality video generated progressively (frame-by-frame)

Conclusion

FramePack is a video diffusion model built to generate long, high-quality AI-driven videos, even on low-resource devices like laptops with limited GPU memory. By optimizing performance for hardware with as little as 6 GB of VRAM, it makes advanced video generation more accessible to a broader audience. Beyond professional use, FramePack can also be employed for creative projects like GIFs, memes, and other dynamic content.

References

Other popular posts