Get featured on INDIAai

Contribute your expertise or opinions and become part of the ecosystem!

Google is using Gemini AI to train its robots so they may become more skilled at navigating their way around and finishing their tasks. In a recent research paper, the DeepMind robotics team described how users may more easily communicate with its RT-2 robots using natural language instructions by utilizing Gemini 1.5 Pro’s extended context window, which dictates the amount of information an AI model can comprehend.  

“To achieve this, we study a widely useful category of navigation tasks we call, Multimodal Instruction Navigation with demonstration Tours (MINT), in which the environment prior is provided through a precisely recorded demonstration video, Google said, adding that the advances in Vision Language Models have shown a promising path in achieving this goal.

The robot navigates its surroundings and interprets commands using the most recent version of Google’s Gemini large language model. For example, when a human says to the robot, “Find me somewhere to write,” it obeys and shows the user where the spotless whiteboard is located in the building. 

Gemini’s ability to handle video and text—in addition to its capacity to ingest large amounts of information in the form of previously recorded video tours of the office—allows the “Google helper” robot to make sense of its environment and navigate correctly when given commands that require some commonsense reasoning. In order to generate precise actions, such as turning in reaction to commands or what it perceives in front of it, the robot combines Gemini with an algorithm.  

A few years ago, for a robot to navigate its environment well, it would have required a map of its surroundings and carefully considered commands. Large language models provide valuable insights into the real world, and their more recent variations, called vision language models, are trained not only on text but also on photos and videos. These models are capable of responding to queries involving perception. Gemini enables Google’s robot to interpret spoken and visual instructions, understanding a whiteboard sketch that indicates a path to a new location. 

The researchers state in their study that they plan to test the technique across various robot types. They added that more complex queries, such as “Do they have my favourite drink today?” from a user who has a lot of empty Coke cans on their desk, ought to be understandable to Gemini. 


Source: Google Deepmind

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE