Background

Google Gives Its Gemini 2.0 Flash Experimental Its Own Native Image Generator

Google Gemini

Large Language Models are definitely becoming smarter.

When OpenAI introduced ChatGPT, the LLM and some subsequent LLMs from rivals were only text-based and that's just about it. Fast forward, AI companies began putting different AIs into an interface, so users can switch them whenever they want.

But then, companies are desperately trying to unlock full multimodal capabilities, and knowing that it's hard, progress is kind of slow.

Google is sitting on huge resources, and it's more than capable of making Gemini better than ever at a pace probably unmatched by most.

And this time, Google has finally released its new Gemini 2.0 Flash Experimental model with the ability to generate and edit images natively.

Results speak for themselves.

In a blog post, Google said that:

"In December we first introduced native image output in Gemini 2.0 Flash to trusted testers. Today, we're making it available for developer experimentation across all regions currently supported by Google AI Studio. You can test this new capability using an experimental version of Gemini 2.0 Flash (gemini-2.0-flash-exp) in Google AI Studio and via the Gemini API."

"Gemini 2.0 Flash combines multimodal input, enhanced reasoning, and natural language understanding to create images."

The idea of making Gemini a proper multimodal AI is to make it more like an AI that suits different purposes.

AI image generation has been available with all major AI chatbots like ChatGPT for quite some time, for sure, and that users have been generating AI images on Gemini and other LLMs as well.

However, to do this, they have to prompt the AI with their query, and that request ill fire up specialized but separate AI-powered image generators, like Imagen 3.

The said model is trained on images and designed only to generate images, but because Imagen 3 is a separate AI, it acts like an extension to the main AI, not part of it.

This time, Google is making Gemini, which is already a language-vision model, to natively multimodal, meaning that it can inherently understand, generate, and modify both text and images.

Google is the first to create an AI with this capability.

With native image generation, users can get better consistency as multimodal models are trained on a large dataset of different modalities. As a result, such models boast better understanding of concepts and exhibit broader world knowledge.

Beyond image generation, users can also easily edit existing images with simple prompts.

For example, users can upload an image and ask the model to add sunglasses, insert legible text, remove objects, and more to the image. And unlike diffusion models which have to regenerate the whole image with each new prompt, natively multimodal models are able to maintain consistency across multiple modifications.

"Unlike many other image generation models, Gemini 2.0 Flash leverages world knowledge and enhanced reasoning to create the right image," said Google.

What's more, Google showcases how the AI is able to render long text without issues.

"Most image generation models struggle to accurately render long sequences of text, often resulting in poorly formatted or illegible characters, or misspellings. Internal benchmarks show that 2.0 Flash has stronger rendering compared to leading competitive models, and great for creating advertisements, social posts, or even invitations," the company said.

Initially, Google is introducing this Gemini 2.0 Flash Experimental model with native image generation on Google’s AI Studio for free.

After launching it in preview, the company plans to releasing it for everyone on the Gemini main platform.

Thanks to its multimodal capability, there are tons of things this Gemini can do that rivals struggle to begin with.

Use cases include, and not limited to, enhancing creative tools like localized artwork creation and detailed image editing, catering to industries such as design, marketing, and content creation, and so forth.

"Whether you are building AI agents, developing apps with beautiful visuals like illustrated interactive stories, or brainstorming visual ideas in conversation, Gemini 2.0 Flash allows you to add text and image generation with just a single model. We're eager to see what developers create with native image output and your feedback will help us finalize a production-ready version soon," said Google

Published: 
17/03/2025