Everything you need to know about Microsoft's new AI model VASA-1

Pillars
IndiaAI Portal
Resources
Ecosystem
Sectors

Back

Results for ""

IndiaAI Recommends

We now have an Artificial Intelligence system that can make photos sing, demonstrating how far AI has come. An AI program developed by a group of Microsoft Research Asia experts can animate audio files and still photos of individuals. It’s not only animation, either. According to reports, the output accurately portrays the subjects in pictures as they sing or speak along with appropriate facial expressions.

In addition to creating lip movements that are flawlessly timed with the audio, this new image-to-video model, named VASA-1, can also capture a wide range of natural head gestures and facial expressions that add to the impression of authenticity and liveliness.

What does the model do?

A single portrait shot and a speech audio track are all that the model requires to generate a lifelike audio-driven talking face in real-time. It claimed that the result is “hyper-realistic” and that it can accurately sync lips and head motions while capturing a wide variety of expressive facial characteristics.

According to Microsoft, “it can handle arbitrary-length audio and steadily output seamless talking face videos.”

What are its capabilities?

Interestingly, the model can process different kinds of images and sounds that weren’t included in the training dataset, like creative images, non-English speech, and singing audio. Microsoft offered a video clip with Da Vinci’s Mona Lisa image singing a rap song as an example.

One of the main innovations is a comprehensive model for generating head movements and facial dynamics that operates in a face latent space. Another is creating an expressive and disentangled face latent space through videos. After conducting exhaustive experiments and evaluating their approach on a range of new criteria, Microsoft found that their method performs much better than previous approaches over a wide range of dimensions.

In addition to providing excellent video quality with realistic facial and head motion, Microsoft’s chosen approach allows the online creation of 512x512 films at up to 40 frames per second with very little initial lag. It opens the door to real-time interactions with realistic avatars that mimic human speech patterns.

Associated risks

Microsoft stated that with an eye toward beneficial applications, the research focuses on creating visual affective skills for virtual AI avatars. It is not meant to provide content intended to trick or mislead.

It might still be misused to impersonate someone, just like other similar content creation strategies. “We are interested in applying our technique for advancing forgery detection and are opposed to any behaviour to create misleading or harmful contents of real persons,” they continued.

Microsoft stated, “Currently, the videos generated by this method still contain identifiable artefacts, and the numerical analysis shows that there’s still a gap to achieve the authenticity of real videos.”

While acknowledging the possibility of misuse, Microsoft opined, it is crucial to acknowledge its significant positive potential. Benefits such as increased educational equity, improved accessibility for people with speech difficulties, providing therapy or companionship to those in need, and many more highlight the significance of the study and related investigations.

Will it be available to the public?

Although the research article does not mention a release date, it does hint that VASA-1 is a research prototype for the time being and that it moves us closer to a future in which AI avatars can interact naturally. The researchers have recognized that VASA-1 can be misused despite its numerous applications. Also, they made the proactive decision to prevent the public from accessing VASA-1.

They realized that such cutting-edge technology must be stewarded to prevent unforeseen effects or misuse. The researchers have stated that although these animations have a lifelike charm and skillfully blend audio and graphics, a deeper inspection may reveal certain small defects and telltale indicators typical of content generated by artificial intelligence. However, the examples demonstrate the team’s technical prowess in working on VASA-1.

Sources of Article

Microsoft Research

IndiaAI Recommends