StoryDiffusion creates coherent comics and videos from text

StoryDiffusion is a new model for generating long-range stories through a coherent series of images or videos. It is able to maintain consistent characters, styles and attires, offering a unified storytelling experience across various formats, such as comics and videos.

The model is open-source under the MIT license and available for use both online and on your PC. You can create stories directly on its official website or download the code from its GitHub repository.

Consistent images generated by StoryDiffusion. Left: a comic about a man finding treasure in the jungle. Right: a comic depicting Lecun’s moon expedition, guided by an image (source: paper)

StoryDiffusion uses Consistent Self-Attention for uniform character design across images and a Semantic Motion Predictor for smooth video transitions and stable subjects.

The evolution of generative modeling tools

Among the generative modeling tools, diffusion models have quickly shown impressive results in creating diverse types of content, such as images, 3D objects, and videos. They work by gradually transforming random noise into a structured output, guided by a learned probability distribution.

Text-to-image generation has seen significant advancements with models like Latent Diffusion, DiT, and Stable XL. New methods such as ControlNet, T2I-Adapter, MaskDiffusion, and StructureDiffusion have been introduced to improve control over image generation.

Various approaches aimed to create images that maintain the identity from a given reference image (ID-preservation):

  1. fine-tune part of the model using a specific image, like in Textual Inversion, DreamBooth, and Custom Diffusion.
  2. use pre-trained models like IPAdapter and PhotoMaker for direct image control.
  3. use Consistent Self-Attention, which is training-free, to ensure subject consistency across multiple images, like in StoryDiffusion.

Text-based video generation is also becoming popular due to its ability to quickly produce custom content, without the need for traditional video production resources. StoryDiffusion outperforms methods like SEINE and SparseCtrl by more accurately predicting and maintaining the semantic coherence of motion throughout the generated videos.

The model

StoryDiffusion uses the concept of self-attention, specifically adapted for visual generation. The method has 2 stages:

(I) Consistent Self-Attention: this module establishes connections between images within a group, ensuring the characters depicted maintain uniformity in their appearance and clothing. It requires no prior training and can be seamlessly integrated into existing systems.

(II) Semantic Motion Predictor used for video generation: it is designed to transform a series of generated images into a cohesive video, ensuring smooth transitions and maintaining subject consistency throughout the video sequence.

Pipeline for creating visually coherent stories

Foundation: a pretrained text-to-image diffusion model that integrates the Consistent Self-Attention mechanism, as illustrated in the next figure.

Input: a story text.

Output: a sequence of images that visually narrate the input story.

Text-to-image generation pipeline (source: paper)

It has 2 main steps:

  1. Split the story and draw the images (a): decompose the input story text into a sequence of prompts, with each prompt corresponding to a distinct image. Create a batch of images simultaneously, using the segmented prompts.
  2. Make connections (d): use Consistent Self-Attention to build connections between the batched images, creating a consistent visual story.

For text-to-image generation, StoryDiffusion used pre-trained models such as Stable Diffusion XL and Stable Diffusion 1.5, augmented with Consistent Self-Attention.

Pipeline for generating consistent transition videos

Foundation: the Semantic Motion Predictor (see the next figure), which is specifically designed for long-range video generation.

Input: conditional images and story text.

Output: videos that show smooth transitions and consistent subject’s identity throughout the frames.

Video generation pipeline (source: paper)

It has 4 main steps:

  1. Encode conditional images: the model encodes the given images into a semantic space. It helps the model to understand the visual arrangement and elements within the images.
  2. Predict transition embeddings: it predicts the character’s significant movements from one frame to the next.
  3. Decode video with control signals: the predicted transition embeddings are then fed into a video generation model.
  4. Generate each frame: due to the control signals provided by the transition embeddings, each frame is generated in such a way that the characters’ motions are smoothly transitioned and the subjects remain consistent throughout the video.

For video generation, the method was implemented on the pre-trained Stable Diffusion 1.5 model. Additionally, it incorporated a Semantic Motion Predictor trained on the Webvid10M dataset to facilitate the creation of transition videos with subject consistency.

Evaluation results

(I) Image generation. Story Diffusion underwent a comparative analysis with the two most recent ID preservation methods, IP-Adapter and Photo Maker, focusing on the consistency of image generation. The comparison highlighted aspects like text controllability, uniformity of facial features, and harmony of clothing in the images.

The qualitative results are shown in the next figure. StoryDiffusion generates highly consistent images, while the alternative techniques such as IP-Adapter and PhotoMaker may produce images with inconsistent attire or diminished text controllability.

Comparison of consistent image generation with recent methods (source: paper)

During the quantitative comparison, StoryDiffusion was evaluated for text-image similarity and character similarity and compared with IP-Adapter and Photo Maker. The table below demonstrates the model’s superior ability to preserve character identity while accurately reflecting the text prompts.

MetricIP-Adapter Photo MakerStoryDiffusion
Text-Image Similarity0.6129 0.65410.6586
Character Similarity0.8802 0.89240.8950
Quantitative comparisons of consistent image generation (source: paper)

(II) Video generation. In transition video generation, the performance of StoryDiffusion was compared with two leading techniques, SparseCtrl and SEINE. StoryDiffusion demonstrated superior results, successfully generating videos with very smooth motion and without any distorted frames in between.

Transition video generated by StoryDiffusion (source: project page)

(III) User study. In the user study, 30 participants were presented with a series of 50 questions designed to evaluate the model’s performance. The following charts showcase the study’s findings, revealing a statistically significant advantage of StoryDiffusion over existing methods in both image and video generation tasks.

User Study on subject-consistent image generation and transition video generation (data source: paper)


StoryDiffusion is a new AI tool for bringing stories to life. It allows you to create stories through a sequence of consistent images, all without the need for additional training.

By employing the Consistent Self-Attention method, it produces a smooth series of images that maintain the continuity of characters and their attire. For video generation, it uses the Semantic Motion Predictor to smoothly integrate these images into a video.

Further research could address some limitations of the model. It can occasionally produce slight inconsistencies in image details like clothing. Additionally, the current implementation is not yet optimized for generating very long videos.

Read more:

Other popular posts