How Researchers Merged Autoregressive and Diffusion To Create A Hybrid, Superior AI

Just like an alloy, which is a combination of two or more different metals to create a new metal with enhanced properties.

This time, researchers fused autoregressive and diffusion methods to create a more powerful AI. The researchers put together the two popular methods to create an image generator that uses a lot less energy. And not only that, because this new AI can also run locally on a laptop or smartphone.

The idea is to create a much superior image generator, which is essential for applications like training self-driving cars to navigate unpredictable road conditions, ultimately making them safer for real-world use.

Since existing generative AI techniques have significant trade-offs that hinder their effectiveness, the researchers experimented with both autoregressive models and diffusion models, and see whether the two could complement each other.

HART is an early autoregressive model that can generate high-resolution images with quality comparable to diffusion models, but much more efficiently.

Popular generative models fall into two primary categories: diffusion models and autoregressive models.

Diffusion models, like Stable Diffusion and DALL·E, excel at producing highly detailed, realistic images. However, they rely on an iterative noise-removal process across multiple steps—sometimes 30 or more—making them computationally expensive and slow.

On the other hand, autoregressive models, which power large language models like OpenAI's ChatGPT, generate images much faster by sequentially predicting image patches. Unfortunately, they cannot correct errors once generated, leading to lower-quality outputs with noticeable imperfections.

This is where the researchers from MIT and Nvidia developed HART (Hybrid Autoregressive Transformer)—an approach that combines the strengths of both methods.

HART employs an autoregressive model to generate the foundational image structure quickly, followed by a lightweight diffusion model that refines details to enhance quality.

The main struggle was during the process of integrating the diffusion model effectively without accumulating errors. After realizing that applying the diffusion model early in the process led to inaccuracies, they then refined results only residual tokens at the final stage significantly improved the model’s performance.

This hybrid approach enables HART to generate images that meet, if not exceed the quality of state-of-the-art diffusion models while operating nine times faster. It also significantly reduces computational overhead, making it feasible to run on standard commercial laptops and even smartphones. Users can simply enter a natural language prompt into the HART interface to generate a high-quality image almost instantly.

In all, HART is a 700-million-parameter autoregressive transformer for initial image generation, trained with a lightweight 37-million-parameter diffusion model for detail enhancement.

Despite its relatively compact size, HART delivers image quality comparable to that of a 2-billion-parameter diffusion model while consuming 31% less computational power. This efficiency unlocks new possibilities for AI applications beyond static image generation.

HART generates 1024-pixel images with quality comparable to some state-of-the-art diffusion models.

"If you are painting a landscape, and you just paint the entire canvas once, it might not look very good. But if you paint the big picture and then refine the image with smaller brush strokes, your painting could look a lot better. That is the basic idea with HART," said Haotian Tang, co-lead author of the research paper about HART.

To make this possible, HART uses its autoregressive model for structural generation, meaning that it can use it to quickly predicts the general composition of the image using discrete tokens, ensuring fast processing.

Then, it uses its diffusion model within it for refining the entire image through an exhaustive noise-removal process, a small diffusion model predicts only residual tokens, which compensate for information loss and enhance intricate details such as edges, facial features, and textures.

By limiting the diffusion model's role to refining details, HART achieves high-quality image generation in just eight steps, a stark contrast to the 30+ steps required by standard diffusion models.

This results in substantial speed gains while preserving superior image clarity.

Long story short, HART is able to outperform larger models with efficiency.

Published:

24/03/2025