Google PaLM-E is a new powerful AI model for advanced robots

A group of researchers from Google Robotics, TU Berlin, and Google Research created PaLM-E, a technology that enables robots to understand language, analyze images, and use both of these abilities to carry out complex tasks.

The model

PaLM-E architecture combines a powerful Large Language Model PaLM, “embodied” (the “-E”) with sensor data from a robotic agent.

Although Large Language Models (LLMs) such as PaLM, GPT-3, or GPT-4 have proven to be highly skilled at complex tasks like language translation, question answering, and generating natural-sounding text, they face a significant challenge when it comes to broad inference in the real-world, especially in robotics.

To overcome this difficulty, the authors propose the use of embodied language models that integrate continuous sensor data, such as images, from the robotic agent directly into the language model.

PaLM-E model architecture

PaLM-E combines the power of PaLM with one of our most sophisticated vision models, ViT-22B. This combination created what the authors believe to be currently the largest vision-language model reported in the research literature.

In essence, PaLM-E receives an input of multimodal “sentences” which can include text, images, robot states, and scene embeddings and generates an output text in an auto-regressive manner.

The model’s performance

PaLM-E was evaluated on three robotic environments. The results show that it is capable of handling a range of embodied reasoning tasks, by using different observation modalities and embodiments.

PaLM-E-562B, in particular, has achieved the highest score ever reported on the OK-VQA dataset, which is a particularly challenging benchmark that requires both visual comprehension and external knowledge of the world.


PaLM-E represents a significant advancement in the field of multimodal learning, as it allows for the training of generally-capable models to simultaneously address vision, language, and robotics.

Beyond robotics, PaLM-E has the potential to enable other applications in multimodal learning by unifying tasks that were previously considered separate.

Learn more:

Other popular posts