Results for ""
Nvidia's Artificial Intelligence (AI) researchers have unveiled an AI that can generate talking heads from a single 2D image, for video conference. According to the team, the system is capable of a wide range of video manipulation - rotation and side-to-side movement of a person's head to motion transfer and video reconstruction.
While the first frame of the AI is a 2D photo, later on, it gathers 3D key points to create a video, without supervision. Not only does the approach outperform existing approaches on benchmark datasets, but also the Nvidia AI achieves an H.264 quality video on 1/10th bandwidth than previously used. "Our model learns to synthesize a talking-head video using a source image containing the target person's appearance and a driving video that dictates the motion in the output. Our motion is encoded based on a novel keypoint representation, where the identity-specific and motion-related information is decomposed unsupervised," states Nvidia's GitHub page, where the research and the code are shared. The Nvidia researchers, Nvidia research scientists Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu published the research paper is on arvix.com, on Monday.
The paper shows how the latest AI model performs models such as FOMM, few-shot vid2vid, and bi-layer neural avatars (bilayer). “By modifying the keypoint transformation only, we are able to generate free-view videos. By transmitting just the keypoint transformations, we can achieve much better compression ratios than existing methods,” the paper reads. This means that with reduced bandwidth usage, users will experience fewer break-ups, jitters, freezes, etc. that are experienced with heavy bandwidth demands of the video conferencing app.
Currently, the video-calling apps transmit huge amounts of pixel-rich images via compressed video signals from the sender's internet connection to the receiver's internet connect, which causes the various glitches in video-conferencing/ calling. However, Nvidia's proposed system only sends selected data - from a few appearance features and 3D canonical key points - such as a caller's eyes, nose, mouth and recreates a synthetic 'face' on the receiver's end. The added movements make the synthetic image more life-like and are more visually acceptable.
This AI advancement can be a feature on Maxine, Nvidia's video conferencing service which was released in October. The service has been reported to provide novel AI-enabled features such as live translations and conversational AI avatars along with Zoom-like virtual backgrounds.
While the enterprise communication landscape is a battlezone currently, where tech giants such as Microsoft, Zoom and Salesforce are already in a tech-war by constantly providing new features, Nvidia's entry into the field with its offerings can possibly take the fight to a whole new level.
Nvidia is perhaps best-known for creating generative adversarial (GANs) models that can blur the lines between reality and digital.
“By dramatically reducing the bandwidth and ensuring a more immersive experience, we believe this is an important step toward the future of video conferencing," read the paper. The AI community has received this advancement very well. Ian Goodfellow — the renowned research scientist who pioneered generative adversarial networks (GANs) — sent kudos to the team on their success: “This is really cool. Some of my PhD labmates worked on ML for compression back in the pretraining era, and I remember it being really hard to get a compression advantage.”