Meta AI has developed ImageBind, a cutting-edge AI tool capable of integrating data from six diverse modalities: text, image, audio, depth, thermal, and Inertial Measurement Units (IMU).
The remarkable thing about ImageBind is that it can combine different kinds of data, such as images, text, audio, depth, thermal, and IMU even if some kinds of data are missing or not related to each other. For example, it can use an image of a car and a sound of a horn to create a video of the car honking.
According to this approach, only image-paired data is sufficient to integrate all the modalities. It doesn’t need to train on all the combinations of paired data.
The new model outperformed the supervised models in identifying new categories of data that it had no training for (zero-shot) or it had very few training examples for (few-shot), across different modalities.

ImageBind opens up numerous possibilities for multimodal applications including retrieving information across different modalities, manipulating and combining modalities, detecting patterns across modalities, and generating new content based on multimodal input.
Method
The goal was to learn a single space where all kinds of data could be represented by using images as natural bridges.
Each modality’s embedding was matched to an image embedding, such as text to image using data from the web and IMU to videos captured from egocentric cameras equipped with IMU sensors.
Remarkably, the learned embedding space could automatically link pairs of data without any specific training data for those pairs.

Multimodal learning has a great potential for real-world applications. However, the lack of extensive multimodal data containing all modalities together is a significant challenge when it comes to learning a comprehensive joint space encompassing all modalities.
By using the powerful vision-language models, such as CLIP, ImageBind overcomes the need for large amounts of paired data.
Furthermore, ImageBind expanded the “zero-shot” capabilities of these models, enabling them to perform new tasks without any specific training for those tasks.
Implementation & training details
ImageBind is a simple and flexible approach that can be applied to a wide range of multimodal learning applications. It enables the researchers and practitioners to experiment with different implementation strategies based on their specific requirements and resources.
All the modalities were encoded by using a Transformer architecture. For example, the Vision Transformer (ViT) was used for encoding images, videos, audio, thermal and depths modalities, each modality having its separate encoder. For the text encoder the team followed the design from CLIP.
The main idea behind ImageBind is to align the embeddings of all modalities to image embeddings.
In order to investigate the effect of image representations, the size of the image encoder is varied, while keeping the size of other modality encoders fixed.
The model was trained on the following datasets:
- Large-scale image-text paired data: A large amount of paired image and text data collected from the internet served as the foundation for training and learning in the system.
- Self-supervised data: In addition to the paired data, the system also used naturally paired “self-supervised” data. For instance, the model learned autonomously to associate an image of a car with the sound of its engine or an image of a person with their body temperature.
- Four new modalities: The system extended its capabilities by incorporating data from four additional modalities: audio, depth, thermal, and IMU.
The experiments were done on 32GB V100 and 40GB A100 GPUs from Nvidia.
Tests & evaluation results
The model was tested on various tasks such as recognizing and finding different types of information without any examples or instructions. For example, to classify an audio clip into one of 527 categories, or to retrieve an image that matches a text query.
It performed very well on these tasks, even better than previous models that were trained specifically for each type of information (see picture below).

Applications
The pictures below show ImageBind’s capabilities to combine information across different modalities.
Given an audio query such as “crackle of a fire****”, the model can retrieves images, videos that contain fire, and text captions that describe fire scenes. It can generate an image of “****a bird at the beach****” by using an image of “a bird” and the sound of the waves.

Embeddings from an image of fruits and the sound of birds retrieves images of birds surrounded by fruits, demonstrating the model’s ability to combine different modalities.

Conclusion
ImageBind is a breakthrough in multimodal learning that can handle text, audio, depth, thermal and more using only image-paired data.
The research showed that just aligning the image with each of other modalities leads to an emergent alignment between other modalities.
It opens up new possibilities for creating and understanding multimodal content with minimal supervision and resources.
Learn more:
- Paper: “ImageBind: One Embedding Space To Bind Them All” (on arXiv)
- Blog: “ImageBind: Holistic AI learning across six modalities” (Meta AI)
- Demo: “One embedding to bind them all”







