Google 'Gemini 2.5 Flash' Is A Hybrid Reasoning Model With A Customizable 'Thinking Budget'

The hype surrounding Large Language Models has sent tech companies into a frenzy.

In an arms race that began when the world was awed by ChatGPT from OpenAI, others quickly jumped into the bandwagon, trying to either adopt the AI from OpenAI, or create their own product. Google, the tech giant of the web and beyond, opted for the latter, and created Gemini.

As the tech world's obsession with LLMs keep on growing, developers of LLMs began to realize the limitations of traditional LLMs.

They found that these LLMs are great talkers, but poor thinkers.

They are good when told to generate human-like text, mimicking styles, and recalling factual knowledge. But they're exceptionally bad when it comes to solving multi-step problems, understanding logic chains, making consistent and context-aware decisions.

LLMs may sound smart, but they often guess rather than reason, and that can lead to wrong or even nonsensical answers.

This is where developers started developing reasoning models.

Gemini 2.5 Flash just dropped.

As a hybrid reasoning model, you can control how much it ‘thinks’ depending on your - making it ideal for tasks like building chat apps, extracting data and more.

Try an early version in @Google AI Studio → https://t.co/iZJNqQmooH pic.twitter.com/gUKbK5x3yZ
— Google DeepMind (@GoogleDeepMind) April 17, 2025

What began with PaLM (Pathways Language Model), back in April 2022, Google built another AI on top of that foundation, and released Gemini 2.0 Flash Thinking in December 2024.

This model was designed to handle complex multimodal tasks—like programming, math, and physics—by reasoning through problems and explaining its thought process.

In March 2025, Google unveiled Gemini 2.5, a new family of AI reasoning models that pause to "think" before answering a question.

This time, Google is finally bringing Gemini 2.5 Flash to Gemini

Where as Gemini 2.5 is the flagship model, the brains of the operation, the 2.5 Flash model is the small sibling made for speed.

While this 2.5 Flash is designed for efficiency, meaning that it operates in lower latency and reduced cost, it's still capable.

But where is really shines, is the "thinking budget" feature, where developers can control how much the model thinks, balancing cost vs. quality

This makes the model ideal for use cases like chatbots, customer service, summarization, or any task where speed is mode needed than depth.

2.5 Flash can adjust how much it reasons based on the complexity of the prompt - enabling faster answers for more simple requests.

Devs can also control the thinking budget to find the right tradeoff between quality, cost, and latency. Here’s how it works →…
— Google DeepMind (@GoogleDeepMind) April 17, 2025

In a blog post, Google said that:

"We know that different use cases have different tradeoffs in quality, cost, and latency. To give developers flexibility, we’ve enabled setting a thinking budget that offers fine-grained control over the maximum number of tokens a model can generate while thinking. A higher budget allows the model to reason further to improve quality. Importantly, though, the budget sets a cap on how much 2.5 Flash can think, but the model does not use the full budget if the prompt does not require it."

For users who wish to keep using Gemini 2.5 Flash at its lowest cost and latency, they can set the thinking budget to 0.

They can also choose to set a specific token budget for the thinking phase using a parameter in the API or the slider in Google AI Studio and in Vertex AI. The budget can range from 0 to 24576 tokens for 2.5 Flash.

Here, developers pay $0.15 per million tokens for input.

Output costs vary dramatically based on reasoning settings: $0.60 per million tokens with thinking set to off, jumping to $3.50 per million tokens when reasoning is enabled and fully deployed.

This nearly sixfold price difference for reasoned outputs reflects the computational intensity of the "thinking" process, where the model evaluates multiple potential paths and considerations before generating a response.

"Customers pay for any thinking and output tokens the model generates," said Tulsee Doshi, Product Director for Gemini Models at Google DeepMindtold.

"In the AI Studio UX, you can see these thoughts before a response. In the API, we currently don’t provide access to the thoughts, but a developer can see how many tokens were generated."

Before this, Anthropic claimed to have created the first-ever hybrid reasoning model, calling it the Claude 3.7 Sonnet.

Here’s how 2.5 Flash continues to lead as the model with the best price-to-performance ratio. ↓ pic.twitter.com/fMMLtuSDCW
— Google DeepMind (@GoogleDeepMind) April 17, 2025

Published:

18/04/2025

Dark Mode

Search form

Google 'Gemini 2.5 Flash' Is A Hybrid Reasoning Model With A Customizable 'Thinking Budget'