In May 2022, MIT Researchers have created a machine learning system that learns to describe data in a manner that captures concepts shared between the visual and aural modalities. For example, their model can recognize and classify the location in a video where specific action occurs.

People see, hear, and understand language, among other things, to get information about the world around them. On the other hand, machines determine what's going on based on data. So, when a machine "sees" a photo, it has to turn that photo into data that it can use to do something like classify images. This process gets harder when inputs come in different formats, like videos, audio clips, and images.

What did they do?

The researchers made a method that uses AI to learn how to represent data in a way that captures concepts that can be seen or heard. For example, their approach can determine that a baby crying in a video clip is related to the word "crying" in an audio clip. With this information, their machine-learning model can determine where a specific action is happening in a video and label it.

Their model also makes it easier for users to understand why the machine thinks the video it found matches their search. Researchers could use this method in the future to help robots learn about the world through their senses, more like how people do.

Technical description

The researchers from MIT have published an article named "Cross-Modal Discrete Representation Learning". In this article, the researchers proposed a framework for cross-modal representation learning with a discrete embedding space shared by different modalities and allowed to understand models. They also suggest a Cross-Modal Code Matching objective that encourages models to show crossmodel semantic concepts in the embedding space. The retrieval performance of video-text, video-audio, and image-audio datasets is improved when their discrete embedding space and objective to existing cross-modal representation learning models. Also, the researchers look at the shared embedding space and find that codewords tend to be used by both video and audio inputs that are semantically related.

What is representation learning?

The researchers primarily work on representation learning, a type of machine learning that tries to change the way input data to make tasks like classifying or making predictions easier.

The representation learning model takes raw data, like videos and the text captions that go with them, and encodes them by extracting features or observations about objects and actions in the video. Then, it puts these points of data on a grid, which is called an embedding space. Next, the model puts single issues on the grid that show groups of similar data. Finally, each of these information points, or vectors, is demonstrated by a single word. A video clip of a person juggling, for example, could be mapped to a vector called "juggling."

Image source: MIT News Office 

The researchers limit the model to only label vectors with 1,000 words. The model can choose which actions or ideas to put into a single vector, but it can only use 1,000 vectors. The model picks the words that it thinks of in the data. Instead of putting information from different modalities on different grids, their method uses a shared embedding space where researchers can put data from two modalities together. This approach lets the model learn the relationship between two kinds of representation, such as a video of someone juggling and an audio recording of someone saying "juggling." They made an algorithm that tells the machine how to put similar ideas into the same vector so that the system can process data from different sources.

How did they test it?

The researchers tested the model on cross-modal retrieval tasks with the following three datasets: 

  • a video-text dataset with text captions, 
  • a video-audio dataset with audio captions, and 
  • an image-audio dataset with audio captions.

For example, in the video-audio dataset, the model picked 1,000 words to describe what was happening in the videos. Then, when the researchers gave it audio queries, the model tried to find the clip that best matched the spoken words. Their method was more likely to find better matches than the models they compared it to, but it was also easier to understand.

Conclusion

According to researchers, this could make it easier for humans to use the model in the real world, where understanding how it makes judgments is vital. Nonetheless, the model has some weaknesses that they plan to address soon. For one reason, their research concentrated on data from two modalities concurrently. However, according to academics, people in the real world deal with multiple data modalities simultaneously. Additionally, the photos and videos inside their collections depicted simple objects or actions. Data from the real world is significantly more chaotic. Additionally, they wish to determine how well their strategy performs with various inputs.

Want to publish your content?

Publish an article and share your insights to the world.

ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in