There’s text-to-text, and then text-to-image. Then came text-to-video… and now, text-to-world, literally.
Imagine it all began with OpenAI's launch of ChatGPT, the model that sparked the LLM revolution, proving that large language models (LLMs) could master human-like dialogue. But in a remarkably short span, these models evolved far beyond mere text.
Text‑to‑image models followed, immersing people in stunning visual creations. Then came text‑to‑video.
This started when OpenAI teased Sora and launched it, Google quickly followed with the release of Veo 3, which is not only able to generate videos from just prompts, but also add synchronized sound to match those visuals.
And now, DeepMind, the company under Alphabet's umbrella, the company that also owns Google, pushes this even further by introducing 'Genie 3.'
The AI model goes way beyond mere text-to-video, effectively transforming the paradigm from video generation into real‑time, interactive world creation.
Unlike Veo 3 or other text-to-video generators that produce passive videos, Genie 3 generates living, navigable worlds.
Given just a textual prompt, it builds fully interactive 3D environments, photo‑realistic or imaginative, at 720p, 24 fps, where human or AI agents can explore, move, and even issue environmental commands.
Unlike static video, these worlds respond dynamically for a few minutes, maintaining spatial and causal consistency because the model remembers what it generated.
Its architecture is auto‑regressive and memory‑enabled. What this means, each frame the AI generates, is generated one after another, with the model referencing earlier frames (up to about one minute back) to ensure physical coherence. As a result of this, objects, lighting, positions remain stable even when the viewer looks away and return.
What makes it unique is that, this memory isn't hard-coded.
Instead, it emerged naturally as the model is scaled up, delivering consistency over time.
Genie 3 builds on both Genie 2 and Veo 3.
Its predecessor, Genie 2, could generate up to around a minute of coherent 3D simulation from a prompt, but only at short durations and without real‑time interactivity. Genie 3 extends this with true real‑time control, prompt‑driven events (like changing weather or inserting objects), and multi‑minute navigation, all while retaining world consistency.
DeepMind positions world models like Genie as essential for embodied agents, where robots and AI can learn through virtual trial and error.
With Genie 3, agents can train across unlimited synthetic environments quickly, anticipating outcomes of actions, planning ahead, and learning through interaction, much like humans do.
It’s not just a dazzling feat, because it’s a research milestone toward artificial general intelligence (AGI).
"We think world models are key on the path to AGI, specifically for embodied agents, where simulating real world scenarios is particularly challenging," said Jack Parker-Holder, a research scientist on DeepMind’s open-endedness team, during the briefing.
Read: Paving The Roads To Artificial Intelligence: It's Either Us, Or Them
There are still limitations.
For example, simulated snow may not flow realistically, text rendering often fails unless explicitly provided, and interactions remain limited in scope.
Then, Genie 3 supports only about a minute per session, falling short of the hours-long continuity needed for complete agent training. This is because Genie 3 can only keep spaces in visual memory for about a minute, meaning that if users turn away from something in a world the AI created and then turn back to it after that duration, things like paint on a wall or writing on a chalkboard will not be in the same place.
Complex multi-agent interactions also remain elusive for now.
Access is also tightly controlled. Genie 3 is initially introduced to only a select group of researchers and creators in a limited research preview, as DeepMind studies potential risks and usage scenarios. No public or commercial release has been announced yet.
Still, Genie 3 feels like a watershed moment, showing how AIs can move from passive video generation to living, interactive worlds that can test, teach, and evolve.
This is certainly a huge leap from static text to images, from images to video, as AI can finally create dynamic environments that can remember and respond.
This capability brings humanity tantalizingly closer to AI systems that can plan, reason, explore—and perhaps, one day, act with general intelligence.
"We haven’t really had a Move 37 moment for embodied agents yet, where they can actually take novel actions in the real world," Parker-Holder said, referring to the historic moment in the 2016 game of Go between DeepMind’s AI agent AlphaGo and Lee Se-dol, in which Alpha Go defeated the world champion, fair and square.
“But now, we can potentially usher in a new era,” he said.
Further reading: AI Terms Like 'AGI' And 'Super Intelligence' Are 'Designed To Activate People's Dopamine'