Google DeepMind’s First “Thinking” Robot AI Is Here

▼ Summary
– Google DeepMind’s Gemini Robotics project uses generative AI to create robots that can “think” before acting, similar to how AI models generate text or images.
– The project introduces two new models: Gemini Robotics 1.5, which generates robot actions from visual and text data, and Gemini Robotics-ER 1.5, which plans the steps for complex tasks.
– Generative AI is considered crucial for robotics because it enables general functionality, allowing robots to handle new situations without needing reprogramming for each specific task.
– Current robots are highly specialized and difficult to deploy, often requiring months of training to perform a single task, a limitation this new approach aims to overcome.
– The embodied reasoning (ER) model achieves high benchmark scores by making accurate decisions about interacting with physical spaces, but it relies on the action model to execute the tasks.
The emergence of generative AI systems capable of producing text, images, and audio has paved the way for a new frontier: commanding physical robot actions. This principle underpins Google DeepMind’s Gemini Robotics initiative, which recently unveiled two collaborative models designed to give robots a form of simulated reasoning before they move. While large language models have their limitations, the integration of a reasoning step has previously enhanced their performance, and a similar breakthrough now appears imminent for robotics.
DeepMind’s researchers argue that generative AI holds transformative potential for robotics by enabling general functionality. Present-day robots are typically specialists, requiring extensive, focused training for a single job and struggling to adapt to new scenarios. Carolina Parada, Google DeepMind’s head of robotics, highlighted this challenge, noting that deploying a robot for one specific task often takes many months of complex installation. Generative systems, by contrast, are built on a more flexible foundation, allowing AI-powered robots to interpret and operate in entirely unfamiliar environments without needing reprogramming. The current DeepMind strategy employs a dual-model architecture: one for planning and another for execution.
This approach is realized through two new models: Gemini Robotics 1.5 and Gemini Robotics-ER 1.5. The first is a vision-language-action (VLA) model; it processes visual and textual information to directly command a robot’s movements. The second model, distinguished by the “ER” suffix for embodied reasoning, functions as a vision-language model (VLM). It accepts visual and text inputs to generate a logical sequence of steps required to accomplish a complicated task.
The model responsible for the “thinking” is Gemini Robotics-ER 1.5. It represents a significant step as the first robotics AI to perform simulated reasoning, a capability similar to that found in modern text-based chatbots. While Google refers to this process as “thinking,” it’s important to recognize this as a descriptive term within the context of generative AI rather than a claim of consciousness. DeepMind reports that the ER model has achieved leading scores on both academic and internal tests, demonstrating its ability to make precise judgments about navigating and manipulating a physical space. Crucially, this model only plans the actions; it does not carry them out. That responsibility falls to its partner, the Gemini Robotics 1.5 model.
(Source: Ars Technica)





