TokenFlow: make high-quality video edits from text prompts using diffusion features

TokenFlow is a framework for text-based video editing that leverages a pre-trained text-to-image diffusion model, without requiring any training or fine-tuning. It uses text prompts to specify the desired edit and generate high-quality videos, while preserving the spatial layout and motion of the input video.

TokenFlow was proposed by a research team from the Weizmann Institute of Science from Israel.

The picture below illustrates how TokenFlow edits videos. It takes an input video (top row) and modifies it based on a target text prompt (middle and bottom rows), while keeping the semantic structure and motion in the original scene.

TokenFlow showcases realistic semantic changes of real-world videos. Image source

TokenFlow can edit a video based on a text prompt. For example, you can write “make the sky blue” and TokenFlow will change the sky color in the video. But TokenFlow does not just edit the video frame by frame. It also keeps the video smooth and natural, as if it was never edited.

What is the process behind TokenFlow’s video editing? The core concept is leveraging information from the original video to edit the new one, ensuring a consistent flow of information throughout the entire video. This approach results in a realistic and coherent edited video.

TokenFlow does not edit each frame by itself. Instead, it uses some of the features from one frame to edit another frame, based on how they match in the original video. It ensures that the diffusion features of the edited video are consistent across frames, just like in the original video. 


TokenFlow uses a token-based representation of the video and the text description of the desired edit. For example, given a video of a dog running in a park and the prompt “a cat running in a park”, TokenFlow can generate a video of a cat running in the same park, while preserving the spatial layout and motion of the original video.

The figure below shows its main steps.

Here is a simple explanation of each step:

Top: video inversion

The top part shows how TokenFlow prepares the input video for editing: inverts each frame of the input video using a diffusion model, and extracts its tokens (i.e. the output features from the self-attention modules).

  • DDIM inversion. For a given input video I, use a diffusion model to convert each frame into a noisy version and get its tokens, which are discrete representations of the features (i.e. the output features from the self-attention modules). DDIM stands for Denoising Diffusion Implicit Models.
  • NN search. Find the relations between the tokens of different frames, (inter-frame feature correspondences) by performing a nearest-neighbor (NN) search between the original video features.

Bottom: denoising step

The bottom part shows how TokenFlow edits and generates the output video by joint editing and TokenFlow propagation.

  • (I) Joint editing. Select keyframes from the noisy video (Jt) and jointly edit them using an extended-attention block (the diffusion model and the text prompt). This step generates a set of edited tokens (Tbase) that match the target text prompt.
  • (II) TokenFlow propagation. The final step is to propagate these edited tokens across the entire video, following the pre-computed correspondences of the original video features. This step ensures temporal consistency in the diffusion feature space.

In summary, the video editing algorithm converts the input video into noisy latents using DDIM inversion. Then, it denoises the video through an iterative process of denoising keyframes and applying TokenFlow propagation with extended-attention at every self-attention block in every layer of the network.

Below we can see example results of TokenFlow using different text prompts such as: “shiny silver robot”, “Van Gogh style”, and “Star Wars clone trooper”.


The method was evaluated on DAVIS and Internet videos using various text prompts.

Qualitative evaluation. The method was compared with four other methods: Tune-A-Video, PnP-Diffusion (per frame), Gen-1, and Text2Video-Zero. We can see the results in the figure below (with TokenFlow in the last row):

TokenFlow (bottom row) compared with other similar methods. Image source
  • Tune-A-Video (second row) changes the video model to fit the video better, but it only works well for short clips. For long videos, it makes weird changes that don’t match the video, like the shiny metal sculpture.
  • Applying PnP per frame separately (third row) makes nice changes that match the text, but they lack any temporal consistency.
  • The results of Gen-1 (fourth row) also change too much from frame to frame (the beak of the origami stork changes color); and their frames are not very clear or sharp compared to a text-to-image diffusion model.
  • The changes of Text2Video-Zero (fifth row) produces edits that have severe jittering (variation or distortion in the edited video frames).
  • TokenFlow (bottom row) makes videos that look more like what the text says and also look smooth and natural, while other methods have problems with both of these things.

Quantitative evaluation. The paper measures how well the edited videos match the text and how smooth they look over time. The method does the best in both measures, compared to baselines.

Ablation study. This is a procedure where certain parts of a system are removed or replaced to study the effect on the system’s performance or behavior. Ablation studies are often used in machine learning to understand how different components of a model contribute to its results.

The ablation study results showed that TokenFlow outperformed all the other variants on all three metrics, indicating that both token-level and flow-level attention are important for generating accurate and fluent descriptions.

Limitations. According to the authors, TokenFlow has some drawbacks. It cannot handle edits that change the structure of the video and may produce visual artifacts when the image editing technique fails.

The LDM decoder introduces some high frequency flickering. A possible solution for this would be to combine TokenFlow with an improved decoder.

Limitations: TokenFlow cannot handle edits that requires structure deviations 

Conclusion, future research

TokenFlow is a framework that allows users to edit videos by changing the text captions. It uses an image diffusion model to keep the motion and structure of the original video.

Future research could investigate other types of diffusion models, or devise ways to deal with edits that require structural changes, such as changing the position or perspective of the objects in the video.

Learn more:

Other popular posts