'Gemini Omni' From Google Can 'Create Anything From Anything': Huge Gain But Not Quiet There Yet

The AI landscape has been defined by the relentless large language models (LLMs) war.

Ever since OpenAI's ChatGPT debuted and reshaped expectations for what conversational intelligence could achieve, the battle involving giants that shook the entire tech sphere. And Google, has astonished the world with Gemini.

Then, the powerhouse in search and foundational models, has also astonished the world with its Veo video generation models, particularly the Veo 3 and Veo 3.1, which delivered photorealistic clips with sounds that seemed poised to dominate the emerging field of generative media. Those early demonstrations highlighted Google's deep expertise in rendering realistic motion and environments, positioning the company as a frontrunner in turning text and images into coherent video sequences that captured everyday physics and visual detail with surprising fidelity.

Yet the competitive dynamics shifted rapidly as specialized tools like ByteDance's Seedance 2 and Alibaba's Happyhorse 1.0 emerged from other developers, pushing the boundaries of multimodal video creation with superior motion fidelity, audio synchronization, and creative control that left many observers noting how Google's earlier leads had been eclipsed.

Then comes 'Omni,' an "any-to-any" model that can create "anything from anything," and "everything from everything."

Omni brings together an improved understanding of physics with Gemini's knowledge of history, biology, and culture, bridging the gap from photorealism to meaningful storytelling.

Actions have consequences, environments respond to events, and narratives evolve logically. pic.twitter.com/ajQ3purg0g
— Google DeepMind (@GoogleDeepMind) May 19, 2026

Seedance 2.0 excelled in handling combined inputs of text, images, video clips, and audio to produce cinematic sequences with director-level precision over camera movements, lighting, and narrative flow, smf Happyhorse 1.0 stood out for its top-ranked performance in blind human evaluations, generating dialogue-driven scenes with natural lip sync and immersive sound that felt remarkably lifelike.

These models raised the bar not just on output quality but on usability, allowing creators to iterate on complex, multi-shot stories without restarting from scratch, and they quickly became the benchmarks against which every new release was measured.

With the introduction of Gemini Omni, Google appears to be mounting a serious effort to catch up in this fast-evolving space, integrating the reasoning prowess of its Gemini models with advanced generative capabilities to enable more intuitive video editing and world-aware storytelling.

And thanks to Gemini, this Omni includes features like conversational editing that let users reimagine a clip by describing changes to environments or objects while preserving the original intent.

Gemini Omni doesn't just build scenes that look real, it reasons about what should happen next. It combines an intuitive understanding of physics with Gemini's knowledge of history, science, and cultural context.

Rolling out today starting with video outputs to Google AI Plus,… pic.twitter.com/EkLjv5O0dN
— Sundar Pichai (@sundarpichai) May 19, 2026

And also thanks to Gemini, users can use Omni to create their own avatars.

Create videos with your own voice and likeness using avatars with Gemini Omni.

When you create an avatar, you have an AI digital version of yourself so you can easily generate videos that look and sound like you. No need to upload your image every time.
— Google Gemini (@GeminiApp) May 19, 2026

Announced as the first step toward a system that can create anything from any input, Omni Flash focuses on video for now but emphasizes native multimodality, letting users feed in text prompts, reference images, audio clips, or existing footage and then refine outputs through natural conversation.

It promises stronger world understanding, where actions follow logical consequences, characters remain consistent across scenes and lighting conditions, and cultural or scientific context informs how narratives unfold, bridging pure photorealism with something closer to meaningful storytelling.

The model is already rolling out in the Gemini app, Google Flow, and YouTube Shorts for subscribers, with API access planned soon, and

You can even reimagine the action in a video you took by asking Gemini Omni.

Transform your world instantly - change the environment, add new objects, or create something completely unexpected. pic.twitter.com/nMUyN7jNzW
— Google DeepMind (@GoogleDeepMind) May 19, 2026

Early demonstrations show Omni Flash producing clips with improved physics simulation and continuity, such as characters interacting naturally with their surroundings or styles being applied seamlessly from reference inputs.

Yet while these advances represent a clear step forward for Google in blending intelligence with media generation, they have not yet closed the full gap left by the current leaders. Comparisons shared shortly after launch highlight that Omni still trails Seedance 2 in overall cinematic polish and motion stability during extended sequences, and it falls short of Happyhorse 1's benchmark dominance in realistic character animation and synchronized audio.

Some outputs reveal lingering inconsistencies in complex physical interactions or subtle artifacts that more mature competitors have largely resolved, suggesting that Omni excels more at editable, context-aware refinement than at generating flawless video from the ground up on its first attempt.