Google’s AI research lab DeepMind has developed an AI-powered model called video-to-audio (V2A) that can generate audio and dialogue for videos. According to the researchers, video generation models are advancing at an incredible pace, but many current systems can only generate silent output. One of the next significant steps toward bringing generated movies to life is creating soundtracks for these silent videos. This critical step is expected to develop a complete audiovisual experience using AI. 

The video-to-audio (V2A) technology 

This video-to-audio (V2A) AI tech works well with videos developed by AI models like Google’s Veo. The technology works by combining video details with text prompts. Users can also suggest additional details to guide the V2A system towards the sounds they want to generate for a video, allowing them to have creative control over the generated soundtrack. 

In an official statement, the company said, “Today, we’re sharing progress on our video-to-audio (V2A) technology, which makes synchronized audiovisual generation possible. V2A combines video pixels with natural language text prompts to generate rich soundscapes for the on-screen action.” 

“Our V2A technology is pairable with video generation models like Veo to create shots with a dramatic score, realistic sound effects or dialogue that matches the characters and tone of a video”, they added. 

It is also capable of generating soundtracks for a range of traditional footage, including archival material, silent films and more, which opens broader creative opportunities. 

A robust control over audio output 

It is important to note that V2A can generate unlimited soundtracks for any video input. Optionally, a ‘positive prompt’ can be defined to guide the generated output toward desired sounds or a ‘negative prompt’ to guide it away from undesired sounds. 

This flexibility gives users more control over V2A’s audio output, allowing them to rapidly experiment with different audio outputs and choose the best match. 

How does V2A work? 

The researchers stated, they experimented with autoregressive, and diffusion approaches to discover the most scalable AI architecture, and the diffusion-based approach for audio generation gave the most realistic and compelling results for synchronizing video and audio information. 

The V2A system starts by encoding video input into a compressed representation. Then, the diffusion model iteratively refines the audio from random noise. This process is guided by the visual input and natural language prompts given to generate synchronized, realistic audio that closely aligns with the prompt. Finally, the audio output is decoded, turned into an audio waveform and combined with the video data. 

To generate higher-quality audio and add the capability to guide the model towards generating specific sounds, the researchers added more information to the training process, including AI-generated annotations with detailed sound descriptions and transcripts of spoken dialogue. 

By training on video, audio and the additional annotations, the V2A technology learns to associate specific audio events with various visual scenes while responding to the information provided in the annotations or transcripts. 

Sources of Article

Source: DeepMind

Image: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE