An intelligent autonomous robot is required in various applications such as space, transportation, industry, and defense. Mobile robots can also perform material handling, disaster relief, patrolling, and rescue operations. Therefore, an autonomous robot can travel freely in a static or dynamic environment. 

The main aim of mobile robot navigation is smooth and safe navigation through a cluttered environment from the start position to the goal position, following a safe path and producing optimal path length. Regarding this matter, researchers have explored several techniques for robot navigation path planning.

Researchers from MIT have developed a method that uses language-based inputs instead of costly visual data to direct a robot through a multistep navigation task. Current approaches often utilize multiple hand-crafted machine-learning models to tackle different parts of the task, which require much human effort and expertise to build. These methods, which use visual representations to directly make navigation decisions, demand massive amounts of visual data for training, which are often hard to come by.

The navigation method developed in collaboration with MIT and the MIT-IBM Watson AI Lab, converts visual representations into pieces of language, which are then fed into one large language model that achieves all parts of the multistep navigation task. This novel method creates text captions that describe the robot’s point-of-view. A large language model uses captions to predict the actions a robot should take to fulfil a user’s language-based instructions.

According to Bowen Pan, an electrical engineering and computer science (EECS) graduate student and lead author of the paper, by purely using language as the perceptual representation, ours is a more straightforward approach. Since all the inputs can be encoded as language, we can generate a human-understandable trajectory.

Finding solutions with language

Pan is of the opinion that since large language models are the most powerful machine-learning models available, the researchers sought to incorporate them into the complex task known as vision-and-language navigation. However, such models take text-based inputs and can’t process visual data from a robot’s camera. So, the team needed to find a way to use language instead.

The technique proposed by the team utilizes a simple captioning model to obtain text descriptions of a robot’s visual observations. These captions are combined with language-based instructions and fed into a large language model, which decides what navigation step the robot should take next. The large language model outputs a caption of the scene the robot should see after completing that step.

The model repeats these processes to generate a trajectory that guides the robot to its goal, one step at a time.

Benefits of language

Apart from outperforming vision-based techniques, the researchers found that it offered several advantages. First, because text requires fewer computational resources to synthesize than complex image data, their method can be used to rapidly generate synthetic training data.The technique can also bridge the gap that can prevent an agent trained with a simulated environment from performing well in the real world. 

Furthermore, the representations their model uses are easier for a human to understand because they are written in natural language. In addition, their method could be applied more easily to varied tasks and environments because it uses only one type of input.

Sources of Article

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE