Multimodal foundation models: the future of AI assistants


Researchers from Google AI and Hugging Face present a comprehensive survey of multimodal foundation models (MFMs), focusing on the transition from specialist models to general-purpose assistants.

Multimodal Foundation Models (MFMs) are AI models that can handle and understand different types of data, such as text, images, and audio.

Specialist MFMs are trained to perform specific tasks, such as image classification or text translation.

General-purpose MFMs, on the other hand, are trained to perform a wide variety of tasks, such as answering questions, generating text, and translating languages.

Applications of MFMs

Here are some specific examples of how MFMs could be used in the future:

  • Search engines that can understand the meaning of images and videos, allowing users to search for content using both text and images.
  • Creative tools that allow people to create and share visual content in new and innovative ways.
  • Virtual assistants that can understand and respond to multimodal input, such as text, images, and audio.

Paper content

The survey is intended for researchers, students, and professionals who are interested in learning the basics and recent advances in multimodal foundation models.

It reviews the state-of-the-art models and applications in various domains and tasks and discusses the future directions and opportunities for improving MFMs.

The research paper covers five main topics:

1. Visual understanding. This chapter focuses on how to train MFMs to learn the fundamental visual features that are necessary for tasks such as object recognition and image classification.

Methods of learning general-purpose vision backbones. (Source: paper)

2. Text-to-image generation. It shows how to train MFMs to generate high-fidelity visual content, including images, videos, neural radiance fields, 3D point clouds, etc. from text descriptions.

An overview of topics on human alignment for generative foundation models. (Source: paper)

3. Unified vision models inspired by large language models (LLMs). This topic focuses on the unification of vision models (see the picture below). LLMs have been shown to be very effective at learning complex relationships between words and phrases. By adapting these architectures to the multimodal domain, researchers hope to create MFMs that can learn complex relationships between different modalities of data.

Summary of the works connected to the unification of vision models.
(Source: paper)

4. Large multimodal models can perform different multimodal tasks by following natural language instructions.

Multimodal GPT-4. (Source: paper)

5. Chaining multimodal tools with LLMs presents different methods for combining MFMs with LLMs to create even more powerful AI systems. For example, researchers are exploring ways to use MFMs to provide LLMs with visual context, which could improve the performance of LLMs on tasks such as question answering and summarization.


MFMs are AI models that can deal with and create different kinds of data, such as text, images, and audio.

They try to comprehend the human intents and do various vision and vision-language tasks in real-world scenarios.

In addition to the potential benefits outlined above, MFMs also have the potential to address some of the challenges that we currently face with AI. For example, MFMs can help to make AI more accessible to people with disabilities, and they can also help to reduce the risk of AI bias.

Learn more:

Research paper: “Multimodal Foundation Models: From Specialists to General-Purpose Assistants” (on arXiv)

Other popular posts