DreamPose: fashion image-to-video generation via Stable Diffusion

April 19, 2023

DreamPose is a new method that generates fashion videos from still images, through a diffusion-based approach. The model uses an image and a corresponding sequence of body poses and produces realistic video sequences that include both human and fabric motion.

Fashion photography is widely used online, but it has limitations in conveying subtle aspects of a piece of clothing, such as its drape and movement when worn.

Fashion videos, on the other hand, are highly informative and can help consumers make better decisions, but they are not widely available.

This study has been conducted by a team from the University of Washington, UC Berkeley, Google Research, and NVIDIA.

Given an image of a person and a sequence of body poses, DreamPose synthesizes a photorealistic video

To overcome these limitations, the team developed a pipeline using the pre-trained Stable Diffusion model. As the model was originally designed for text-to-image synthesis, the team had to modify its architecture and customize it to generate animated fashion videos from still images.

The model was fine-tuned on a collection of fashion videos from the UBC Fashion dataset, and the method was evaluated on various clothing styles and body poses.

The results of the experiment show that DreamPose achieves state-of-the-art performance in fashion video animation, outperforming other existing methods, including Motion Representations for Articulated Animation (MRAA) and Thin-Plate Spline Motion Model (TPSMM).

Source: DreamPose project page

The DreamPose architecture is derived from the Stable Diffusion model, which was modified to enable image and pose conditioning.

To achieve this, the CLIP text encoder was replaced with a dual CLIP-VAE(Variational Autoencoder) image encoder, and an adapter module was added.

The adapter integrates the pre-trained CLIP and VAE image encoders and enable them to work together.

The two-phase fine-tuning process:

The modified Stable Diffusion model (the denoising UNet and Adapter module) is fine-tuned on the entire dataset to improve the model’s generalization capabilities. The adapter blends both signals coming from CLIP and VAE together and transforms the output into the expected shape for the denoising U-Net’s cross-attention modules.
The model (the UNet, Adapter, and VAE decoder) is fine-tuned again, but this time using a single subject image. This process helps to improve the accuracy of the model’s predictions by tailoring it to the specific characteristics of the input image, allowing for more precise and personalized animations.

Tests and results

The experiments were conducted on two NVIDIA A100 GPUs with a resolution of 512×512 pixels. The model was trained and evaluated on the UBC Fashion dataset, with 339 training and 100 test videos. Each video had a frame rate of 30 frames per second and was approximately 12 seconds in length.

It was found by the research team that fine-tuning the VAE decoder is essential for obtaining a more realistic appearance in the synthesized output frames, with sharp and detailed features.

The results showed that the DreamPose model is capable of generating accurate and high-quality videos that align with the input frame, including garment folds, fine-grain patterns, and face identity.

Conclusion

The novel approach is able to transforms fashion photographs into realistic animated videos, by using one image and a corresponding sequence of human poses.

The revised model architecture, which substitutes the CLIP encoder with a dual CLIP-VAE image encoder and an adapter module, offers enhanced accuracy in managing the video generation process.

The advantage of using a pre-existing Stable Diffusion model is that it has already been trained on a large dataset of natural images, and has learned to effectively model the distribution of such images.

By using a pre-trained model and fine-tuning it for a specific task, the team simplified the process of animation and saved time and resources that would otherwise be necessary to train a new model from scratch.

The generated outputs exhibited some shortcomings when tested, such as limbs vanishing into the fabric, distorted dress features, and misaligned poses when the target pose was facing backwards.

In order to address these challenges, future research could focus on enhancing the accuracy of body pose estimation, while also considering expanding the dataset.