Magic Insert, the new style-aware drag-and-drop technology from Google

July 22, 2024

Magic Insert is a new method proposed by Google that lets you drag-and-drop anything from one image into another, even if their styles are completely different. The inserted object looks like it truly belongs to the new scene.

With Magic Insert you can drag-and-drop a subject from an image with an
arbitrary style onto another target image with a different style (source: paper)

You can find the paper, demo and dataset on the project page.

Magic Insert stands out from previous tools

Magic Insert uses a unique combination of tools to maintain the stylistic coherence between the inserted subject and the target image. Instead of editing the background directly, the method creates the new element (subject) first, considering the background’s style. Then, it inserts the subject into the scene.

Overall, Magic Insert offers a significant improvement over traditional approaches like inpainting to seamlessly transfer objects between images with different artistic styles. They also created a new dataset (SubjectPlop) to help evaluate and improve this technology further.

Method

The Magic Insert method can be explained in 4 steps:

1. Fine-tune a pretrained diffusion model for the subject. It simultaneously trains the weights (LoRA deltas) and two text embeddings. The embeddings capture the identity and detailed features of the subject.

2. The model from step 1 is infused with a representation of the target style. They use CLIP (Contrastive Language-Image Pretraining) to create a style embedding from the target image. Using IP-Adapter (Image Prompt Adapter), this embedding is injected into the UNet blocks of the personalized model from step 1. This way, the generated subject image adheres to the style of the target image.

Style-aware personalization (source: paper)

3. Place (copy-paste) a segmented version of the generated style-aware subject image into the target image.

4. The subject insertion model generates contextual elements such as shadows and reflections. This ensures the subject looks cohesive and realistic within the target image context. The specific method used here is called Bootstrapped Domain Adaption.

Bootstrap Domain Adaptation for Subject Insertion

Current subject insertion techniques, like diffusion-based inpainting, focus on filling in image areas surrounding the inserted object. These methods face various challenges, such as difficulties in generating suitable content for smooth areas, the unintentional removal of objects behind the inserted subject, and trouble in creating complete figures. Magic Insert solves these problems by copying and pasting the stylized subject image into the target image, and then inserting contextual details such as shadows and reflections.

Bootstrap Domain Adaptation was integrated into Magic Insert to adapt existing subject insertion models, initially trained on real-world images, to work effectively with stylized images. This technique enables it to adapt its effective domain by leveraging a subset of its own outputs. Its training process has 4 main steps:

Subjects and shadows removal: using a subject removal/insertion model they remove subjects and shadows from the target domain images. This model is trained on real-data (Objectdrop, in this case).
Filtering outputs: the resulting images are filtered to exclude those which are flawed. This can be done using human feedback or automatically using a quality evaluation module.
Subject insertion
Retraining: the model is retrained on its own filtered outputs.

This iterative process trains the model to seamlessly integrate a stylized subject into a target background image.

Bootstrap Domain Adaptation (source: paper)

Using the pre-trained subject insertion module without Bootstrap Domain Adaptation leads to missing shadows and reflections, as well as added distortions and artifacts. The picture below shows that using bootstrap domain adaptation results in better outputs.

Evaluation results

The next picture showcases some examples of the Magic Insert method applied to various subjects and backgrounds, using a wide range of styles.

Magic Insert applied to various subjects and backgrounds, each showcasing a wide range of styles (source: paper)

Comparison of baseline methods

Magic Insert’s performance was compared with baseline methods like StyleAlign and InstantStyle. These methods were combined with various techniques like ControlNet and VLM (ChatGPT-4) to manage the style and content of image generation. Both qualitative and quantitative comparisons highlight the superior performance of the new model, significantly surpassing other approaches (see the next picture).

Style-aware personalization baseline comparison (source: paper)

Additionally, a user study was conducted to gather human feedback on the generated images. In the user study, 60 participants were recruited to compare the full method (with ControlNet) against two baselines: StyleAlign ControlNet and InstantStyle ControlNet. They ranked the methods based on subject identity preservation, style fidelity with the background image, and realistic subject insertion. Users showed a clear preference for the outputs generated by Magic Insert over the baseline methods (see the table below).

Method	User Preference
Magic Insert over StyleAlign ControlNet	85%
Magic Insert over InstantStyle ControlNet	80%

User Study (source: paper)

The advantages of Magic Insert

Magic Insert offers several compelling advantages that make it a valuable tool for both professional and amateur image editors:

High realism: the technology ensures that inserted subjects look natural and blend seamlessly with the target image.
Improved efficiency: through the utilization of LoRA for fine-tuning, the model significantly reduces the number of trainable parameters.
User-friendly interface: the intuitive drag-and-drop functionality streamlines the image editing process, enabling users of all experience levels to achieve professional-grade results with minimal effort.

Conclusion

Magic Insert is a new powerful tool for image editing, offering a highly realistic and user-friendly drag-and-drop functionality. By using LoRA, CLIP representation, and Bootstrapped Domain Adaptation, the model ensures that the inserted subject looks natural and blends seamlessly with the new background.