Background

Inception Wants To Speed Up LLM Text And Code Generation Using Diffusion-Powered Models

Inception

Language and code both follow a natural order. They have a beginning and an end.

Most modern large language models (LLMs) from Google, OpenAI, Anthropic, Meta, and others generate text and code through a method known as 'autoregression.' In simple terms, the model produces content one token at a time, each new token predicted based on everything that came before it.

This approach delivers impressive accuracy and coherence.

But there’s a catch: speed.

Because autoregression is inherently sequential, the model must finish predicting one word before it can move on to the next. This token-by-token process slows down generation, making long responses noticeably sluggish. The effect compounds at scale, leading to high computational costs and inefficiencies. And this becomes more apparent in real-time applications or enterprise deployments.

Inception, however, introduces a completely new way to think about this problem.

Unlike the usual autoregression approach, Inception wants to use a method borrowed from the world of image and video generation models, where models like DALL·E, Midjourney, and Sora can produce complex visuals all at once rather than pixel by pixel.

The method, called 'diffusion', allows Inception to create LLMs that can drastically reduce latency while maintaining coherence and quality.

Instead of crafting responses sequentially, diffusion models start from structured randomness and refine it in parallel across multiple steps until the final, coherent output emerges. In simple terms, while autoregressive models "speak" word by word, diffusion models "form" the entire thought at once. This allows Inception’s systems to generate responses up to ten times faster without losing accuracy or contextual understanding.

The company’s flagship product, Mercury, is the world’s first commercially available diffusion LLM.

And its design makes it remarkably efficient.

Inception claims it’s between five and ten times faster than speed-optimized models from OpenAI, Anthropic, and Google, while delivering comparable precision. Mercury is available in two versions: a general-purpose conversational model and Mercury Coder, a variant tuned specifically for code generation. Both support a massive 128,000-token context window.

This is an equivalent of about 300 pages of text.

This allows for deeper, more complex conversations and programming tasks.

By eliminating the need to predict tokens one by one, it reduces GPU demands, meaning organizations can either run larger models at the same cost or handle more users without scaling up their infrastructure. This makes it ideal for applications where speed defines the experience, such as live voice assistants, dynamic user interfaces, and real-time coding tools.

For developers and enterprises, that translates to lower operational costs and a smoother, more natural interaction with AI systems.

Inception Labs AI, a Palo Alto–based startup, was founded in 2024 by Stefano Ermon, Aditya Grover, and Volodymyr Kuleshov, researchers from Stanford, UCLA, and Cornell.

The company has raised $50 million in funding led by Menlo Ventures, with participation from Mayfield, Innovation Endeavors, NVentures (NVIDIA’s venture arm), Microsoft’s M12 fund, Snowflake Ventures, and Databricks Investment. With this fresh capital, Inception plans to expand its research and engineering teams, accelerate product development, and enhance real-time AI performance across language, voice, and programming applications.

By using diffusion instead of autoregression, Inception is taking a bold step toward redefining how language models generate text and code.

Stefano Ermon, Inception’s CEO and one of the original architects of diffusion techniques used in major generative systems, believes this marks the beginning of a new era for language AI.

He argues that while training models has become faster, inference, the process of generating responses, remains the true bottleneck.

Diffusion, he says, is the key to making high-performance AI practical and scalable.

"Training and deploying large-scale AI models is becoming faster than ever, but as adoption scales, inefficient inference is becoming the primary barrier and cost driver," Ermon explained.

"We believe diffusion is the path forward for making frontier model performance practical at scale."

Diffusion-based language models like Inception’s Mercury generate text in parallel rather than word by word, making them much faster and more efficient than traditional autoregressive models.

However, this new approach has trade-offs.

Diffusion models can struggle with fine-grained coherence and logical flow because they don’t build sentences sequentially. They’re also harder to control or adjust mid-generation, since the entire output is refined as one block instead of step-by-step.

Training diffusion-based LLM is also more complex and computationally demanding, and the field lacks the well-understood scaling laws that make autoregressive systems predictable to build and optimize. Diffusion models are also less interpretable, which complicates debugging, safety checks, and reinforcement learning.

Despite these weaknesses, their parallel nature offers huge speed advantages, making them promising for real-time applications if researchers can close the coherence and reliability gap.

This is why Inception is designing models with self-correcting mechanisms to reduce hallucinations, along with unified multimodal capabilities that allow them to handle text, image, and code interchangeably.

Published: 
08/11/2025