A text-to-video machine learning model accepts a natural language description as input and generates a video that matches it.

In March 2023, Alibaba Research published a critical article that used many ideas identified in latent image diffusion models in video production. Although different approaches exist, comprehensive latent diffusion models are currently considered the state-of-the-art in video diffusion. While the outputs may be unsatisfactory, numerous models demonstrate high control and the capacity to make film in various artistic styles.

Here are the five attractive text-to-video AI models you can try in 2024.

CogVideo

A group of researchers from Beijing's University of Tsinghua developed a large-scale pre-trained text-to-video generative model called CogVideo. The team constructed the model using the CogView2 text-to-image model to use its pre-training information. 

CogVideo proposes a set of models capable of producing videos from a given description. The system employs two models to make frames hierarchically: the first generates the "keyframes," while the second interpolates between the generated keyframes.

Phenaki

The Phenaki Video team used Mask GIT to create text-guided films in PyTorch. The model generates videos directed by text that can last up to two minutes.

Instead of accepting the anticipated probabilities, the study proposes a tweak: bring in an extra critic to decide what to mask during repeated sampling. It will help you choose which elements to emphasize during the video-making process—it's like getting a second opinion.

The adaptable model allows researchers to train on text-to-image and text-to-video. They can begin with graphics and then fine-tune using video for unconditional training.

VideoPoet

VideoPoet is a large language model trained on many videos, images, audio, and text. This model can perform a wide range of video-generating tasks, such as converting text or images into movies, enhancing films with different styles, filling in missing parts of videos, extending the content of videos beyond their original frames, and extracting audio from videos.

The model is constructed based on a simple concept: transform any autoregressive language model into a system that generates videos. Autoregressive language models can create text and code at a high rate. However, they encounter an obstacle when it comes to video. To address this issue, VideoPoet utilizes various tokenizers capable of converting video, image, and audio segments into a format it can comprehend.

Lumiere

Google has developed an artificial intelligence system called Lumiere, which utilizes a novel diffusion model called Space-Time-U-Net, abbreviated as STUNet. As reported by Ars Technica, Lumiere employs a method that does not involve the process of combining individual frames. Instead, it identifies the spatial location of objects inside a video and simultaneously analyzes their movements and changes over time.

Lumiere is not yet accessible for ordinary individuals to replicate. However, it suggests that Google has the talent to create a highly advanced AI video platform that could surpass the already accessible models such as Runway and Pika. Google has significantly advanced AI for video games within two years.

Sora

OpenAI recently demonstrated Sora, their latest text-to-video model. The model's "profound comprehension of language" and ability to create "captivating characters that convey vivid emotions" have generated excitement among everyone. Individuals on social media express extreme excitement and astonishment at the high level of realism displayed in the films, deeming it a revolutionary development.

Before its public release, the AI startup implements precautionary measures to ensure safety. Additionally, they acknowledge that Sora experiences specific difficulties, such as maintaining fluidity and distinguishing between left and right.

Sources of Article

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE