Researchers at MIT have built a machine learning model that can understand the underlying relationships between objects in a scene and generate realistic images from text descriptions.

Generally, when humans see a scene, they focus on the items and their interactions. Numerous deep learning models struggle to see the world in this way because they lack an understanding of the complex interactions between individual items. Without this knowledge, a robot built to assist someone in the kitchen would struggle to execute directions such as “take up the spatula on the left side of the stove and place it on a cutting board.”

MIT researchers created a model that comprehends the fundamental relationships between objects in a scene to address this issue. Individual relationships are represented in these models one at a time and then combined to explain the complete scenario. This relationship enables the model to provide a more realistic image from the text description, even when the scene comprises several objects with varying relationships to one another. This task applies when an industrial robot is required to conduct complex multi-step operational activities, such as stacking things in a warehouse or assembling appliances. Additionally, it advances the research toward developing a computer capable of learning from and interacting with its surroundings the same way a human does.

According to Yilun Du, a doctorate student at the CSAIL’s Institute of Intelligence, “Based on the table, we cannot conclude that an object exists in the XYZ position. That is not how our minds work. When we comprehend a scene, we comprehend the relationships between the objects.

One relationship At a time

The researchers’ framework effectively generates images of the scene based on textual descriptions of the objects and their relationships, such as “wooden table to the left of the blue stool.” To the right of the blue stool is a crimson sofa. Their technique breaks these words down into two short sections that clarify the particular relationships: “a wooden table to the left of the blue stool” and “a red sofa to the right of the blue stool”). Each model is unique. These components are integrated with an optimisation process to create a scene image. The researchers described the links between individual items in a scene using an energy-based model machine learning technique. This technique enables encoding each relational description using a single energy-based model.

“Other systems aggregate all the relationships and generate the image from the description in a single shot. However, such an approach results in an out-of-distribution description, such as a more complex description. It fails because these models cannot modify a single shot to create an image with more relationships but must combine these discrete smaller models. As a result, we may model more relationships and adapt to novel combinations,” Du explains.

The method also works in reverse: given an image, one can retrieve a written description corresponding to the relationships between the elements in the scene. Additionally, one can change the image using those models by rearranging the items in the scene to correspond to the new description.

Recognising complicated scenarios

The researchers compared the model to existing deep learning techniques given verbal descriptions and instructed to generate visuals illustrating the corresponding items and their relationships. Each time, the model was superior to the baseline. Additionally, the researchers presented the model photographs of previously unseen scenes and various alternative text descriptions for each image. It correctly picked the description that best fit the item relationships in the image.

Conclusion

Furthermore, when the researchers provided the system with two relational scene descriptions that depicted the same image in distinct ways, the model recognised that the descriptions were equal. While these preliminary results are encouraging, the researchers want to evaluate how their model performs on more complex real-world photos with noisy backgrounds and objects that overlap. They are also interested in someday putting their approach into robotics systems, which would enable a robot to deduce object relationships from movies and subsequently control items in the real world.

For more information, refer to the article.

Sources of Article

https://arxiv.org/pdf/2111.09297.pdf https://news.mit.edu/2021/ai-object-relationships-image-generation-1129

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE