Meta releases ImageBind, a multisensory AI model that integrates six types of data

Meta AI has developed ImageBind, a cutting-edge AI tool capable of integrating data from six diverse modalities: text, image, audio, depth, thermal, and Inertial Measurement Units (IMU).

The remarkable thing about ImageBind is that it can combine different kinds of data, such as images, text, audio, depth, thermal, and IMU even if some kinds of data are missing or not related to each other. For example, it can use an image of a car and a sound of a horn to create a video of the car honking.

According to this approach, only image-paired data is sufficient to integrate all the modalities. It doesn’t need to train on all the combinations of paired data.

The new model outperformed the supervised models in identifying new categories of data that it had no training for (zero-shot) or it had very few training examples for (few-shot), across different modalities.

ImageBind integrates data from six diverse modalities: text, image, audio, depth, thermal, and IMU

ImageBind opens up numerous possibilities for multimodal applications including retrieving information across different modalities, manipulating and combining modalities, detecting patterns across modalities, and generating new content based on multimodal input.

Method

The goal was to learn a single space where all kinds of data could be represented by using images as natural bridges.

Each modality’s embedding was matched to an image embedding, such as text to image using data from the web and IMU to videos captured from egocentric cameras equipped with IMU sensors.

Remarkably, the learned embedding space could automatically link pairs of data without any specific training data for those pairs.

Different kinds of data are naturally paired in different sources, for example images+text and video+audio on the web, depth or thermal data with images. ImageBind connects all these kinds of data in a shared space.

Multimodal learning has a great potential for real-world applications. However, the lack of extensive multimodal data containing all modalities together is a significant challenge when it comes to learning a comprehensive joint space encompassing all modalities.

By using the powerful vision-language models, such as CLIP, ImageBind overcomes the need for large amounts of paired data.

Furthermore, ImageBind expanded the “zero-shot” capabilities of these models, enabling them to perform new tasks without any specific training for those tasks.

Implementation & training details

ImageBind is a simple and flexible approach that can be applied to a wide range of multimodal learning applications. It enables the researchers and practitioners to experiment with different implementation strategies based on their specific requirements and resources.

All the modalities were encoded by using a Transformer architecture. For example, the Vision Transformer (ViT) was used for encoding images, videos, audio, thermal and depths modalities, each modality having its separate encoder. For the text encoder the team followed the design from CLIP.

The main idea behind ImageBind is to align the embeddings of all modalities to image embeddings

In order to investigate the effect of image representations, the size of the image encoder is varied, while keeping the size of other modality encoders fixed. 

The model was trained on the following datasets:

  1. Large-scale image-text paired data: A large amount of paired image and text data collected from the internet served as the foundation for training and learning in the system. 
  2. Self-supervised data: In addition to the paired data, the system also used naturally paired “self-supervised” data. For instance, the model learned autonomously to associate an image of a car with the sound of its engine or an image of a person with their body temperature. 
  3. Four new modalities: The system extended its capabilities by incorporating data from four additional modalities: audio, depth, thermal, and IMU. 

The experiments were done on 32GB V100 and 40GB A100 GPUs from Nvidia. 

Tests & evaluation results

The model was tested on various tasks such as recognizing and finding different types of information without any examples or instructions. For example, to classify an audio clip into one of 527 categories, or to retrieve an image that matches a text query.

It performed very well on these tasks, even better than previous models that were trained specifically for each type of information (see picture below).

ImageBind outperformed specialist models in audio and depth, based on benchmarks

Applications

The pictures below show ImageBind’s capabilities to combine information across different modalities.

Given an audio query such as “crackle of a fire****”, the model can retrieves images, videos that contain fire, and text captions that describe fire scenes. It can generate an image of “****a bird at the beach****” by using an image of “a bird” and the sound of the waves.

ImageBind’s novel multimodal capabilities: cross-modal retrieval, combination of different modalities, audio to image generation

Embeddings from an image of fruits and the sound of birds retrieves images of birds surrounded by fruits, demonstrating the model’s ability to combine different modalities.

An example of embedding space arithmetic where a combination of image+audio embeddings was used for image retrieval

Conclusion

ImageBind is a breakthrough in multimodal learning that can handle text, audio, depth, thermal and more using only image-paired data.

The research showed that just aligning the image with each of other modalities leads to an emergent alignment between other modalities.

It opens up new possibilities for creating and understanding multimodal content with minimal supervision and resources.

Learn more:

Other popular posts