
Large language models (LLMs) are very capable, but their autoregerresive approach has one big weakness.
LLMs that operate autoregressively generate output one token at a time, from left to right. At each step the model predicts the single most likely next token given everything that has already been produced. Once a token is emitted it is fixed in the sequence. The process repeats for the following token, and so on.
Because each prediction depends on the full preceding context, the model must complete one token before it can consider the next. Tokens are typically subword units rather than whole words, but the sequential dependency remains the same.
This left-to-right commitment creates practical constraints.
An early choice that later proves inconsistent or incorrect cannot be revised without regenerating from an earlier point. Errors can propagate forward through the rest of the output. In tasks that require global consistency, such as filling gaps in code, balancing mathematical constraints, or maintaining formatting across a block, the model has limited ability to look ahead or reconsider earlier decisions. The causal attention mask used in these models reinforces the restriction: each position can attend only to earlier positions, never to tokens that have not yet been generated.
To solve this, Google has released 'DiffusionGemma,' an experimental LLM that uses an open-weights model that applies diffusion techniques to text generation.
Read more in our blog: https://t.co/14eMidnMRf
— Google Gemma (@googlegemma) June 10, 2026
Diffusion models, first popularized for image synthesis, take a different route than the traditional autoregerresive.
They begin with a field of pure random noise and then iteratively remove that noise over a series of refinement steps. At every step the model predicts how to adjust the current noisy state toward a cleaner version, using information from the entire image rather than building it piece by piece from one corner. The same canvas is updated in parallel across many denoising passes until the result converges on coherent structure.
DiffusionGemma adapts this idea to text.
Built on the Gemma 4 26B Mixture of Experts backbone, it activates roughly 3.8 billion parameters during inference and runs under an Apache 2.0 license with weights available on Hugging Face.
The approach departs from the dominant method used in most large language models today.
Intelligent Self-Correction: Similar to AI image generators, the model iteratively refines its own output. It evaluates the entire text block at once to seamlessly close formatting and fix mistakes in real-time.
— Google Gemma (@googlegemma) June 10, 2026
Generation begins with a canvas of 256 random or placeholder tokens.
Over multiple denoising steps the model refines the entire block simultaneously.
Bidirectional attention lets every position on the canvas attend to every other position at the same time.
Tokens that become confident early can provide context that helps resolve neighboring uncertain positions. Low-confidence tokens can be re-noised and re-resolved in later passes, giving the model an internal mechanism for self-correction. The process continues until the block settles into coherent text.
Because it generates everything at once, DiffusionGemma unlocks new patterns of model behavior.
Fast: Generates up to 1,000+ tokens a second for up to 4x faster text generation.
Lightweight: Runs smoothly right on 18GB consumer graphics cards.
Smart editing: Since it… pic.twitter.com/xTCFSQKZrT— Google (@Google) June 10, 2026
Because the entire block is processed in parallel within each denoising step, the computation shifts from being limited by repeated memory access to being limited by raw compute throughput.
On dedicated GPUs this produces substantially higher token throughput than sequential autoregressive decoding on the same hardware. Reported figures reach more than 1,000 tokens per second on a single NVIDIA H100 and more than 700 tokens per second on an RTX 5090-class card when quantized. The model fits comfortably in 18 GB of VRAM after quantization, making it runnable on high-end consumer GPUs.
For outputs longer than 256 tokens the system uses a hybrid strategy called block autoregressive diffusion.
Once one 256-token block has been fully denoised it is committed to the key-value cache.
The model then initializes a fresh canvas for the next block, conditioned on the preceding committed context. Within each block the generation remains parallel and bidirectional; across blocks it stays sequential. This preserves the ability to produce extended coherent text while retaining the speed advantage inside every block.
Accessible Hardware: A 26B Mixture of Experts (MoE) model that activates only 3.8B parameters during inference. Fits comfortably within 18GB VRAM limits of high-end dedicated consumer GPUs when quantized.
— Google Gemma (@googlegemma) June 10, 2026
Bi-directional Attention: Generating 256 tokens in parallel allows every token to attend to all others. Unlocks significant advantages for non-linear domains like in-line editing, code infilling, and mathematical graphs.
— Google Gemma (@googlegemma) June 10, 2026
The bidirectional view and iterative refinement bring measurable differences on certain tasks. In structured problems where every element constrains many others, such as Sudoku grids represented as strings, the base model starts with near-zero success.
After targeted fine-tuning the same architecture reaches roughly 80% correctness while using fewer denoising steps overall. Similar advantages appear in code infilling, closing unbalanced markup or JSON, and generating outputs that must satisfy global formatting or logical constraints.
The model can adjust earlier parts of a block as later context becomes clearer, rather than being locked into an initial path.
Quality remains lower than the standard autoregressive Gemma 4 models on many general-purpose benchmarks, especially where factual precision or long-range coherence without strong structural cues is required.
The current release is positioned as experimental. Its strongest practical value appears in latency-sensitive, single-user, local workflows where the ability to produce a usable block quickly and then iterate or edit matters more than peak accuracy on every token.
Examples include inline code suggestions, real-time collaborative drafting, or rapid prototyping of constrained outputs.
We're releasing DiffusionGemma as an open model under an Apache 2.0 license for anyone to experiment with.
Download the model weights on @huggingface, and learn more about DiffusionGemma → https://t.co/nPFBhQQqqj pic.twitter.com/ZcRbe3LsT6— Google (@Google) June 10, 2026
The architecture reuses the same underlying Gemma 4 MoE weights with an added diffusion head, which simplifies integration into existing serving stacks.
Support already exists in vLLM, Hugging Face Transformers, MLX, and several other frameworks. Fine-tuning recipes and community GGUF conversions have appeared quickly after release.
In short, most current language models generate by repeatedly asking "given what I have written so far, what comes next?"
DiffusionGemma instead asks "given this noisy block and the surrounding context, how can the whole block be improved in the next refinement step?"
The shift from sequential token-by-token prediction to parallel iterative refinement of fixed-size blocks changes both the speed profile and the kinds of reasoning patterns the model can express efficiently.