The researchers developed a HiP framework which generates comprehensive plans for robots by leveraging the knowledge of three distinct foundation models. Robots can perform multi-step construction, manufacturing, and household tasks.

Your daily agenda is relatively uncomplicated: clean the dishes, purchase groceries, and attend to other little matters. You probably did not explicitly write down the instructions "pick up the first dirty dish" or "wash that plate with a sponge" because these individual phases in the task seem instinctive. While humans can effortlessly perform each task without much cognitive effort, a robot needs a sophisticated strategy that entails more intricate blueprints.

MIT's Improbable AI Lab, a Computer Science and Artificial Intelligence Laboratory (CSAIL) division, has introduced a multimodal framework called Compositional Foundation Models for Hierarchical Planning (HiP). This framework utilizes the knowledge of three distinct foundation models to generate detailed and achievable plans. Similar to OpenAI's GPT-4, the underlying models of ChatGPT and Bing Chat are trained on extensive amounts of data to perform tasks such as image generation, text translation, and robotics.

HiP differs from RT2 and other multimodal models by utilizing three foundation models, each trained on separate data modalities for vision, language, and action. Every foundational model encompasses a distinct aspect of the decision-making process and collaborates when making decisions. HiP eliminates the requirement for access to linked vision, language, and action data, which is arduous to acquire. HiP enhances the transparency of the reasoning process.

A task considered a routine duty for a person can be classified as a "long-horizon goal" for a robot. It refers to a comprehensive purpose that entails doing numerous smaller tasks beforehand. Computer vision researchers have tried to construct comprehensive foundational models for this issue; nevertheless, combining linguistic, visual, and action data is costly. HiP, on the other hand, embodies a distinct and versatile approach by integrating linguistic, physical, and environmental intelligence into a robot at a low cost.

Evaluation

The CSAIL team evaluated the precision of HiP on three manipulation tasks, surpassing similar frameworks in performance. The system employed deductive reasoning to build intelligent strategies adapting to new data.

Initially, the researchers instructed the participants to vertically arrange bricks of varying colours and position additional pieces nearby. However, there was a limitation: Certain hues were missing, necessitating the robot to put white blocks into a paint basin to colour them. HiP demonstrated exceptional adaptability in effectively accommodating these modifications, particularly when compared to cutting-edge task planning systems such as Transformer BC and Action Diffuser. It dynamically altered its plans to stack and position each square as required efficiently.

Planning method

HiP's planning method functions hierarchically, employing a three-pronged approach. Each process component can be pre-trained on various datasets, including data from domains beyond robotics. Located at the lowermost part of the order is a large language model (LLM) that generates ideas by comprehensively gathering all the necessary symbolic information and formulating a high-level plan for the activity. The model utilizes the general knowledge it collects from the internet to divide its main objective into smaller, more specific goals. For instance, the process of "making a cup of tea" involves:

  • Filling a pot with water.
  • Boiling the pot.
  • Doing the necessary subsequent activities.

Visual sensors

Additionally, these models require perception, such as visual sensors, to comprehend the surrounding world and accurately accomplish each objective. The researchers employed a comprehensive video diffusion model to enhance the preliminary planning conducted by the LLM, which gathers geometric and physical data about the world from online videos. Consequently, the video model produces a strategy for observing the trajectory, improving the LLM's structure to include newly acquired physical information.

Iterative refinement

The iterative refinement approach enables HiP to think logically about its concepts, including feedback at each step to produce a more pragmatic framework. The feedback process resembles creating an article in which an author submits their draft to an editor. After incorporating the suggested improvements, the publisher reviews the article for any remaining changes and finalizes it.

Conclusion

This model consists of a series of first-person views that deduce the appropriate actions based on the surrounding environment. At this stage, the video model's observation plan is applied to the viewable space for the robot. It helps the machine determine how to accomplish each task to achieve the long-term goal. Suppose a robot uses HiP (Hierarchical Intelligent Planning) to prepare tea. In that case, it implies that it has accurately surveyed and identified the locations of the teapot, sink, and other significant visual components. It will proceed to do each sub-task accordingly.

However, the potential of the multimodal work is constrained by the absence of top-notch video foundation models. Upon becoming accessible, they may connect with HiP's compact video models to further augment the accuracy of visual sequence prediction and the generation of robot actions. An enhanced version would also decrease the existing data demands of the video models.

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE