
Google has rolled out 'Gemini 3.1 Flash Live,' its most advanced real-time audio and voice model to date, alongside the global expansion of Search Live.
The new model aims to deliver faster responses, more natural speech patterns, and smoother handling of conversational nuances compared to earlier versions. It processes acoustic details such as pitch, pace, tone, and rhythm, allowing it to detect signs of frustration or confusion in a user's voice and adjust accordingly.
The system also manages interruptions more effectively, filters out background noise like traffic or television sounds, and maintains context across longer exchanges, reportedly following a train of thought for roughly twice as long as previous iterations.
These improvements address common pain points in voice AI, where even small delays (sometimes exceeding 300 milliseconds) or unnatural inflections could make interactions feel stiff or robotic.
With Gemini 3.1 Flash Live, latency has been noticeably reduced, resulting in responses that flow more like everyday conversation.
The model supports multimodal inputs, combining audio with visual data from a phone's camera, and operates across more than 90 languages without requiring users to switch settings manually. Google describes it as a step change in reliability for real-time dialogue, making voice-first experiences feel more intuitive for everyday users as well as for more complex applications.
Give it a try in the Gemini mobile app and let us know what you think!
— Google Gemini (@GeminiApp) March 26, 2026
Search Live, the conversational search feature that lets users speak queries instead of typing them, is now available in more than 200 countries and territories, wherever Google's AI Mode is supported.
Previously limited to markets such as the U.S. and India since its initial launch in 2025, the global rollout is powered directly by the new Gemini 3.1 Flash Live model.
According to Google in a blog post, users open the Google app on Android or iOS, tap the Live icon near the search bar or AI Mode button, and begin speaking. The system handles follow-up questions in a back-and-forth exchange, pulling in relevant web information and spoken responses.
A key aspect of Search Live is its integration with Google Lens.
Users can activate the camera to provide visual context, allowing the AI to analyze objects or scenes in real time. For instance, someone struggling with furniture assembly could point their phone at the pieces and receive step-by-step spoken guidance, along with links to additional resources or troubleshooting tips. The feature supports ongoing dialogue about what the camera sees, blending voice input with visual understanding to create a more interactive search experience.
In the Google Lens interface, a Live option appears at the bottom of the screen for quick access.
The same underlying model enhances Gemini Live, enabling longer brainstorming sessions or extended conversations where the AI maintains coherence over multiple turns.
Gemini 3.1 Flash Live is our highest quality audio & voice model yet - and a big leap towards building next-gen voice-first agents. Lower latency, better precision, more natural interactions... try it now with Gemini Live in the @GeminiApp or build with it in @GoogleAIStudio! https://t.co/JIqaaVlTuM
— Demis Hassabis (@demishassabis) March 26, 2026
On performance, Gemini 3.1 Flash Live has posted strong results on several audio-focused benchmarks.
It achieved 90.8% on ComplexFuncBench Audio, which evaluates handling of multi-step tasks, and scored 36.1% on Scale AI’s Audio MultiChallenge, a test that includes interruptions, hesitations, and distracting background noise.
It’s better at completing tasks and understanding details in noisy environments.
It can follow long conversations so you don’t have to repeat yourself. pic.twitter.com/WsxCQNDqwB— Google DeepMind (@GoogleDeepMind) March 26, 2026
While this represents progress over prior real-time audio models, it still trails some non-conversational systems on the MultiChallenge benchmark, highlighting that fully replicating the fluidity of unscripted human exchanges remains an ongoing challenge.Arstechnica
To help address potential misuse, every audio output from the model includes an embedded SynthID watermark, an imperceptible digital marker designed to allow detection of AI-generated speech. This measure aims to reduce risks such as the spread of misinformation or deceptive content. At the same time, the increased naturalness of the voice responses has prompted discussion about scenarios where it might become harder to immediately tell whether one is interacting with an AI or a human, particularly in phone-like conversations.Arstechnica
Overall, the launch reflects Google's continued push to integrate voice, vision, and search into more seamless, real-time tools. The changes emphasize lower latency, broader language support, and multimodal capabilities without requiring major adjustments from users in supported regions. As the features propagate through the Google app, more people will be able to test conversational search and voice interactions directly, using their voice and camera to explore information in a fluid, context-aware manner.
While the technology marks a clear advancement in making AI dialogue feel less mechanical, it also underscores the evolving complexities around authenticity and verification in voice-based interactions.