Google released the "super brain" PaLM-E in history, and robots have become versatile since then.

What happens when ChatGPT has vision?

Editor’s note: ChatGPT has grabbed most of the limelight in the AI field during this time. But recently, an AI model PaLM-E launched by Google, which has visual ability and can guide robots without special training, has also shown impressive capabilities. The emergence ability of this largest visual language model so far makes people think about general artificial intelligence. The article comes from compilation.

On Monday, a team of artificial intelligence researchers from Google and Technical University of Berlin launched a multimodal visual language model (VLM), which is called Palm-E. The model has 562 billion parameters and integrates vision and language for controlling robots. Researchers claim that this is the largest VLM ever, and it can perform various tasks without retraining.

According to Google, PalM-E can generate an action plan for a mobile robot platform (developed by googlebot) with a mechanical arm, and then execute it by itself, just by giving it a high-level command, such as "Give me the rice cake in the drawer".

PaLM-E realizes this by analyzing the data from the robot camera, and the whole process does not need to preprocess the scene representation. In this way, there is no need for human beings to preprocess and annotate the data, and the control of the robot can be more autonomous.

In the demo video provided by Google, PaLM -E carries out the instruction of "get me a bag of rice chips from the drawer", which includes several planning steps and visual feedback from the robot camera.

This model is also flexible and can respond to the environment. For example, the PaLM-E model can guide the robot to the kitchen to take out the rice cake bag. Because PaLM-E is integrated into the control system, it can be tolerant of possible interruptions during the task. In a video example, the researchers put the rice cake bag picked up by the robot back several times, but the robot will find the rice cake bag again and pick it up again.

In another example, the same PaLM-E model is shown to control the robot autonomously through tasks with complex sequences. Previously, such tasks often required manual guidance. Google’s research paper explains how PaLM-E transforms instructions into actions:

We demonstrated the performance of PaLM-E in challenging and diverse mobile control tasks. In the setting, we mainly follow the setting of Ahn and others. (2022), that is, robots need to plan a series of navigation and manipulation actions according to human instructions. For example, give the instruction "I spilled my drink, can you bring me something to clean up?" After that, the robot needs to plan an action sequence including "1. Find the sponge, 2. Pick it up, 3. Give it to the user, 4. Put it down". Inspired by these tasks, we developed three use cases to test PaLM-E’s embodied reasoning ability: fitness prediction, fault detection and long-horizon planning. Low-level policies come from RT-1 (Brohan et al., 2022), which is a transformer model. It can use RGB images and natural language commands, and then output end-effector control commands.

PaLM-E belongs to the "next-token predictor", so it is called "PaLM-E" because it is based on Google’s so-called "PaLM" large language model (similar to the technology behind ChatGPT). By adding sensory information and robot control, Google "visualized" PaLM.

Because it is based on the language model, PaLM-E can continuously observe, for example, images or sensor data, and encode them into a series of vectors with the same scale as language tags. In this way, the model can "understand" sensory information in the same way as language.

Google also provided a demonstration video, which showed that a robot "gave me a green star" under the guidance of Palm-E. The researchers said that this green star "is an object that this robot has not directly touched before."

In addition to the RT-1 robot transformer, PaLM -E also draws lessons from Google’s previous work on ViT-22B. ViT-22B is a visual transformer model released in February this year. ViT-22B has been trained in various visual tasks, such as image classification, object detection, semantic segmentation and adding subtitles to images.

Google Robotics is not the only research group dedicated to robot control using neural networks. This research reminds people of the paper recently published by Microsoft (ChatGPT for Robotics), which also discusses the control of robots by combining visual data with large language models in a similar way.

Robots aside, Google researchers have observed some interesting effects, which are obviously because PaLM-E uses a large language model as its core. First of all, it has the performance of "positive migration", which means that it can transfer the knowledge and skills learned from one task to another. Compared with the robot model with single task, the performance of the former is significantly higher than that of the latter.

In addition, they also observed a trend of model scale: "The larger the language model is, the more it can maintain its language ability when training with visual language and robot tasks-in terms of quantity, the PaLM-E model with 5620 parameters almost maintains all its language ability. “

PaLM-E is the largest VLM reported so far. Although we have only received the training of single image prompt, we have observed the emergence of emerging abilities such as multimodal thinking chain reasoning and multi-image reasoning. Although this is not the focus of our work, PaLM-E has set a new SOTA (Best Performance) on the OK-VQA benchmark.

——Danny Driess

Researchers claim that PaLM-E has demonstrated its emergent abilities, such as multi-mode thinking chain reasoning (which allows models to analyze a series of inputs including language and visual information) and multi-image reasoning (which uses multiple images as inputs to make reasoning or prediction), even though it has only been trained with single image cues. In this sense, as the deep learning model becomes more and more complex, PaLM-E seems to continue to surprise people.

Google researchers also plan to explore more applications of PaLM-E in real-world scenes, such as home automation or industrial robots. They hope that PaLM-E can stimulate more research on multimodal reasoning and embodied AI.

The word "multimodal" is very hot now, and we will hear more and more in the future, because major companies want to make general artificial intelligence that looks like human beings to perform general tasks.

Translator: boxi.