MultiModal-GPT: a vision and language model that can dialogue with humans

May 10, 2023

MultiModal-GPT is a generative model that can engage in multi-round conversations with humans using images and words. It has the ability to perform various tasks, such as producing descriptive captions, identifying and counting specific objects in images, and answering general questions asked by the users.

The model was introduced by a team of researchers from Shanghai AI Laboratory, the University of Hong Kong, and the School of Electrical and Information Engineering from Tianjin University.

MultiModal-GPT was built upon OpenFlamingo, which is an open-source large multimodal model that can handle both image and word inputs and generate text.

Although OpenFlamingo is a powerful language model, it cannot conduct complex conversations that involve both text and images. To address this limitation, the researchers enhanced its capabilities by training it on large datasets of instructions that consist of image and text data.

The team also added a Low-Rank Adapter (LoRA), which is a method for fine-tuning large language models with fewer parameters and less memory.

The model

MultiModal-GPT’s overall framework can be seen in picture below.

The model includes:

Vision Encoder that can understand visual data
Perceiver Resampler that processes the spatial information from the Vision Encoder
Language Decoder that translates the visual information into text
LoRa to improve the program’s accuracy

Training

The model was trained to predict the next token within a given text sequence.

To fine-tune the MultiModal-GPT, they froze the entire OpenFlamingo model and incorporated LoRA into the following parts of the Language Decoder: self-attention, cross-attention, and Feed-Forward Network (FFN).

The researchers carried out a joint training by combining two distinct types of data: language-only instruction-following data and vision and language instruction-following data.

The datasets used in the study:

language datasets: Dolly and Alpaca GPT4
vision datasets: LLaVA, Mini-GPT4, A-OKVQA, COCO Caption, and OCR VQA

Evaluation results

The researchers’ experiments demonstrated that MultiModal-GPT was able to have meaningful conversations with humans for multiple rounds and provided relevant and informative responses (see picture below).

The MultiModal-GPT can give a recipe to bake lasagna, and tell users where to eat it

How the model generates a response

Suppose you want to have a dialogue with MultiModal-GPT using an image of a lasagna dish (see picture above). You send the image and ask the model “how to make this dish”. To generate a response, it will follow these steps:

The model gets your inputs: the question and the image of the dish.
It uses the Vision Encoder to process the image and extract the features (shape, color, texture of the ingredients).
The Perceiver Resampler resamples the spatial features to match the size and shape of the Language Decoder’s input. The resampled spatial features are ready to be used by the Language Decoder.
The model further uses the Text Encoder to encode your question into a sequence of tokens, such as “how”, “to”, “make”, “this”, and “dish”.
The Language Decoder generates a text output based on the image features and the text tokens.
The model sends you the text output describing the steps or ingredients for making the dish, such as: “To make this dish, you need to prepare the ingredients and follow the recipe. First, you will need to prepare the pasta sauce…”

If you wish to continue the conversation, you may send another text input or image to the model.

The researchers noted that the quality of the training data is an important factor in determining how well the model performs in dialogues with humans.

If the dataset used to train the model has limited information, it generates only brief replies.