Microsoft has introduced a new artificial intelligence (AI) model that can generate hyper-realistic videos of talking human faces. Dubbed VASA-1, the AI image-to-video model can generate videos from just one photo and a speech audio clip. The company says the created videos will have synchronised lip movements to match the audio as well as facial expressions and head movement to make it appear natural. Notably, the tech giant does not intend to release a product or API with the VASA-1 model and claims that it will be used to create realistic virtual characters.
In a post on its Research announcement page, Microsoft detailed the workings of its under-development AI model and highlighted its capabilities. The company claims that the VASA-1 model can generate videos of 512 x 512p resolution at up to 40 FPS. The AI model is also said to support online video generation with negligible starting latency. X (formerly known as Twitter) user Kaio Ken shared a video of the AI model in action.
While the biggest achievement of VASA-1 is to render up to one-minute-long videos (as per the demos) in high quality with a single static image, the company also highlighted its ability to generate lip movements that match the audio file and facial expressions to go along with it. The AI video generation model also offers granular control to the user to control different aspects of the video such as main eye gaze direction, head distance, emotion offsets, and more. These attribution controls over disentangled appearance, 3D head pose, and facial dynamics can help modify the output closely as per the user’s directions.
Further, the AI model was also able to generate videos using artistic photos, singing audio, and non-English speech. Microsoft researchers point out that the capability for these functionalities was not present in its data, hinting at its self-learning ability.
The AI model’s hyperrealistic video generation of real people with any audio is impressive, but it also raises a question about its unethical usage, especially to create deepfakes. The company highlighted that it does not intend to release the AI model to the public and wants to create virtual interactive characters using it.
Microsoft also said that this technique can be used for advancing forgery detection. “While acknowledging the possibility of misuse, it’s imperative to recognize the substantial positive potential of our technique. The benefits – ranging from enhancing educational equity, improving accessibility for individuals with communication challenges, and offering companionship or therapeutic support to those in need – underscore the importance of our research and other related explorations. We are dedicated to developing AI responsibly, with the goal of advancing human well-being,” the company added.