Researchers At Google Unveil ‘VLOGGER’, An AI Capable Of Turning Photos Into Videos


The AI industry was quite dull and boring, and that it barely made ripples that disrupted other industries.

But since OpenAI introduced ChatGPT, the company started the generative AI trend that made it seem that everyone wants to come onboard. Pretty much all tech giants are either developing and/or experimenting on the technology to see how it can benefit them.

From mere text generators, to image generators, this time, generative AI can even create videos from just text inputs.

Following OpenAI's Sora, and how China quickly caught up when Alibaba launched EMO, this time, it's Google's turn.

Researchers working for the tech titan have developed a new artificial intelligence system that can generate lifelike videos of people speaking, gesturing and moving - from just a single still photo.

Calling it the 'VLOGGER', it relies on advanced machine learning models to synthesize realistic footage, opening up a range of potential applications.

Described in a research paper (PDF) titled VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis, the researchers explained that the AI model can take a photo of a person and an audio clip as input, and then output a video that matches the audio.

The result can show the person speaking the words and making corresponding facial expressions, head movements and even hand gestures.

This groundbreaking technology is part of Google’s Gemini model, which is set to revolutionize the way users interact with avatars and multimedia content.

The biggest difference between VLOGGER AI model and others that have before it, VLOGGER incorporates additional control mechanisms, which should make VLOGGER take the concept of avatar creation to new heights.

With only a portrait photo and an audio content, users can easily make the photo "move," essentially making it a video, with sound.

The researchers, led by Enric Corona at Google Research, use a type of machine learning model called diffusion models.

This approach isn't new, as many others have also used diffusion models to create AI's with remarkable performance at generating highly realistic images from text descriptions.

But at Google, the researchers extended the approach, by making the AI learn from videos, and train it on a vast new dataset, called MENTOR, which contains more than 800,000 diverse identities and 2,200 hours of video.

This order of magnitude larger than what was previously available, allowed the team to create an AI system that can bring photos to life in a highly convincing way.

VLOGGER was able to learn to generate videos of people with varied ethnicities, ages, clothing, poses and surroundings, without much bias.

"In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate," the authors wrote.

At this time, VLOGGER has limitations, in which the generated videos are relatively short and can only have a static background.

Furthermore, the subject the AI animates also cannot move around a 3D environment, and that the person mannerisms and speech patterns, while realistic, are not yet indistinguishable from those of real humans.

While the results are not perfect, and can still have artifacts here and there, VLOGGER represents a significant leap in the ability to animate still images.

"We evaluate VLOGGER on three different benchmarks and show that the proposed model surpasses other state-of-the-art methods in image quality, identity preservation and temporal consistency," the authors reported.

For example, the technology opens up a range of compelling use cases.

The paper demonstrates VLOGGER’s ability to automatically dub videos into other languages by simply swapping out the audio track, to seamlessly edit and fill in missing frames in a video, and to create full videos of a person from a single photo.

One could imagine actors being able to license detailed 3D models of themselves that could be used to generate new performances. The technology could also be used to create photorealistic avatars for virtual reality and gaming.

It may also enable the development and the creation of AI-powered virtual assistants and chatbots that are more engaging and expressive.

Google sees VLOGGER as a step toward "embodied conversational agents” that can engage with humans naturally through speech, gestures and eye contact."

"VLOGGER can be used as a stand-alone solution for presentations, education, narration, low-bandwidth online communication, and as an interface for text-only human-computer interaction," the authors explained.

But again, just like previous groundbreaking AI tools that came before it, VLOGGER also raises concerns around deepfakes and misinformation.