Meta has unveiled its own media-focused AI model, Movie Gen, which can generate realistic video and audio clips. 

To showcase Movie Gen’s powers, the company released several 10-second movies, one of which included a Moo Deng-esque baby hippo paddling around. This Movie Gen announcement follows its Meta Connect event, featuring the most recent version of its huge language model, Llama 3.2, and new and updated hardware, even if the tool is not yet usable.  

The Movie Gen model can make specific adjustments to an already-existing clip, such as altering a surface’s appearance or adding an object to someone’s hands, in addition to producing simple text-to-video clips. In one of Meta’s example videos, a woman wearing a VR headset was altered to look like she was wearing steampunk binoculars. 

As the most advanced and immersive storytelling suite of models, Movie Gen has four capabilities: video generation, personalised video generation, precise video editing, and audio generation. Meta has trained these models on a combination of licensed and publicly available datasets.  

The performances of each of these capabilities are given below. 

Video generation 

Given a text prompt, we can leverage a joint model optimised for text-to-image and text-to-video to create high-quality and high-definition images and videos. According to Meta, this 30B parameter transformer model can generate videos of up to 16 seconds at 16 frames per second. “We find that these models can reason about object motion, subject-object interactions, and camera motion, and they can learn plausible motions for a wide variety of concepts—making them state-of-the-art models in their category,” said Meta in their blog post. 

Personalised Videos 

Furthermore, Meta also expanded the above foundation model to support personalised video generation. The company emphasised that they take a person’s image as input and combine it with a text prompt to generate a video that contains the reference person and rich visual details informed by the text prompt. “Our model achieves state-of-the-art results when it comes to creating personalised videos that preserve human identity and motion,” Meta opined. 

Precise video editing 

The editing variant of the same foundation model takes both video and text prompts as input, executing tasks with precision to generate the desired output. It combines video generation with advanced image editing, performing localised edits like adding, removing, or replacing elements and global changes such as background or style modifications. “Unlike traditional tools that require specialised skills or generative ones that lack precision, Movie Gen preserves the original content, targeting only the relevant pixels,” the blog post said. 

Audio generation 

Meta has also trained a 13B parameter audio generation model that can take a video and optional text prompts and generate high-quality and high-fidelity audio for up to 45 seconds, including ambient sound, sound effects (Foley), and instrumental background music—all synced to the video content. Further, they introduce an audio extension technique that can generate coherent audio for videos of arbitrary lengths, achieving state-of-the-art performance in audio quality, video-to-audio alignment, and text-to-audio alignment overall. 

Sources of Article

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE