The world is noisy. Computers, thanks to various sensors, know that. But creating one is not as easy as it seems.
Generating audio for videos with large language models is hard because syncing sound perfectly with moving images is tricky, and LLMs aren’t naturally built to handle both audio and visuals together. The difference between continuous audio signals and discrete visual data makes integration complex. Plus, there aren’t enough good datasets pairing video and audio, which limits training quality.
Evaluating audio quality is tough since it’s subjective and lacks clear metrics.
On top of that, ethical risks like deepfakes and high computing costs make it a challenging problem that needs more research and careful use.
This is why historically, AI-generated videos have been silent, requiring creators to manually add audio tracks post-production. This process was time-consuming and often disrupted the creative flow.
'Veo 3' changes this paradigm by integrating audio directly into the video generation process.
Say goodbye to the silent era of video generation: Introducing Veo 3 — with native audio generation.
Quality is up from Veo 2, and now you can add dialogue between characters, sound effects and background noise.
Veo 3 is available now in the @GeminiApp for Google AI Ultra… pic.twitter.com/7rcXeBslyU— Google (@Google) May 20, 2025
There are lots of video generators out there, and OpenAI's Sora video generator is one of the most talked about tool out there.
But Veo 3 is able to distinguish itself amongst the crowd with its ability to also incorporate audio into the video that it creates is a key distinction. The company said that Veo 3 can incorporate audio that includes dialogue between characters as well as animal sounds.
"Veo 3 excels from text and image prompting to real-world physics and accurate lip syncing," Eli Collins, Google DeepMind product vice president, said in a blog post.
Unveiled at Google I/O 2025, Veo 3 represents a major leap beyond its predecessor, Veo 2, which primarily generated realistic visuals from text and image prompts.
Now equipped with native audio generation, Veo 3 lets users provide text descriptions of scenes, characters, and sounds. It then produces fully synchronized video clips complete with matching audio—dialogue, sound effects, and ambient noise—making content creation smoother, more immersive, and seamlessly integrated.
This innovation is made possible by DeepMind's advancements in "video-to-audio" AI, which enables the model to understand visual cues and generate appropriate audio.
Video, meet audio.
With Veo 3, our new state-of-the-art generative video model, you can add soundtracks to clips you make.
Create talking characters, include sound effects, and more while developing videos in a range of cinematic styles. pic.twitter.com/5Hfpetfg8b— Google DeepMind (@GoogleDeepMind) May 20, 2025
"For the first time, we’re emerging from the silent era of video generation," said Demis Hassabis, the CEO of Google DeepMind.
"[You can give Veo 3] a prompt describing characters and an environment, and suggest dialogue with a description of how you want it to sound."
Google Veo 3 marks a significant leap forward in AI video generation by addressing the longstanding challenge of integrating audio with visuals. This advancement not only enhances the realism and immersion of AI-generated content but also empowers creators to bring their visions to life more efficiently and effectively.
Additionally, Veo 3 incorporates DeepMind's SynthID watermarking technology to embed invisible markers into generated frames, ensuring authenticity and combating misinformation.
From capturing real-world physics - like the noise and movement of water, or the look and sound of walking in snow - to lip syncing, Veo 3 is great at understanding what you want.
You can tell a short story in your prompt, and the model gives you back a clip that brings it to… pic.twitter.com/ePh3mnOQZt— Google DeepMind (@GoogleDeepMind) May 20, 2025
Lastly, Google unveiled 'Flow,' a new filmmaking tool that allows users to create cinematic videos by describing locations, shots and style preferences.
Accessible through Gemini, Whisk, Vertex AI and Workspace, Flow is essentially "the best of our most advanced models Veo, Imagen and Gemini into one master filmmaking tool."
Get into the zone with Flow.
It combines the best of our most advanced models Veo, Imagen and Gemini into master filmmaking tool - helping you weave cinematic clips, dynamic scenes, and compelling narratives into stories with consistent results. pic.twitter.com/E2a7NJNuP9— Google DeepMind (@GoogleDeepMind) May 20, 2025
LLMs have come a long way — evolving from generating just text to creating stunning images, and now full-fledged videos. With each leap, the competition heats up, pushing boundaries faster than ever. The race to deliver richer, more immersive content is on, and every new breakthrough raises the bar higher for what AI can achieve next.
And with Imagen 4, and also with Veo 3 and Flow, Google wishes to remain competitive as imagery and video become popular use cases for generative AI prompts.
