Results for ""
MIT researchers have developed an innovative AI technique that enables machine-learning models to identify specific actions within lengthy videos. This method can efficiently locate desired actions, potentially revolutionizing virtual training processes and assisting clinicians in reviewing diagnostic videos, making the review process faster and more accurate.
There is an abundance of instructional videos available on the internet that may educate interested viewers on a wide range of topics, including cooking the perfect pancake and executing a life-saving Heimlich manoeuvre. However, identifying a specific action's exact time and location in a lengthy film can be laborious.
Optimally, a user would articulate the desired action, and an AI model would promptly go to the corresponding segment in the film. However, training machine-learning models to perform this task requires a substantial amount of costly video data that humans have meticulously annotated.
Researchers at MIT and the MIT-IBM Watson AI Lab have developed a more efficient method to train a model for spatio-temporal grounding. This method relies solely on movies and their automatically generated transcripts. The researchers train a model to comprehend an unannotated movie using two unique approaches: analyzing minute details to determine the geographical location of objects and using the overall context to grasp the timing of actions.
In contrast to other AI systems, their method is more accurate in identifying actions within longer films that contain several activities. Remarkably, it was discovered that training on both spatial and temporal input concurrently enhances the model's ability to detect each aspect individually. Furthermore, this approach could benefit healthcare environments by efficiently identifying crucial periods in movies of diagnostic procedures.
Typically, researchers instruct models to learn spatiotemporal grounding using movies where humans have marked the specific times when certain tasks begin and end. Generating this data is not only costly, but it can also challenge humans in accurately determining the appropriate labels. If the specified action is "cooking a pancake," does the execution of this action commence when the chef initiates the process of combining the ingredients or when she proceeds to pour the batter into the pan?
The researchers utilize unlabeled instructional videos and corresponding text transcripts obtained from a platform such as YouTube as training material for their methodology. These do not require any specific preparation. The training method was divided into two segments. Firstly, the machine-learning model is trained to analyze the entire movie to comprehend the specific activities at particular moments. A global representation refers to this high-level information.
Secondly, the model is trained to direct its attention towards a specific spot in the video where action occurs. In a spacious kitchen, the model may specifically concentrate on the wooden spoon that a chef utilises to blend pancake batter rather than encompassing the entire countertop. This specific and detailed information is referred to as a local representation.
The researchers integrate an extra element into their framework to alleviate discrepancies between the narration and video. For example, the chef may discuss preparing the pancake before doing it. The researchers focused on unedited recordings that spanned several minutes to devise a more authentic solution. Conversely, the majority of AI techniques are trained utilizing brief video segments that have been edited to display a single action.
They also focused on human-object interactions more effectively with their method. For example, many alternative systems might focus on essential objects, such as a stack of pancakes sitting on a counter, if the action is "serving a pancake." Instead, their approach concentrates on the precise instant the cook turns a pancake onto a platter.
The researchers intend to improve their methodology so that models can recognize when text and narrative are out of alignment and automatically shift their attention between the two modalities. Given that actions and the noises that objects create are typically highly correlated, they also wish to expand their system to include audio data.
Research article: What, when, and Where? Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Image source: COPILOT