
OpenAI announced a set of new models for its Realtime API, focused on improving voice interactions through greater intelligence and responsiveness.
The release includes 'GPT-Realtime-2,' which incorporates reasoning capabilities at the level of GPT-5, along with two supporting models for translation and transcription. These additions allow developers to create voice agents that process audio input and output in a continuous loop rather than relying on separate steps for speech recognition, language processing, and synthesis.
The models became available immediately in the API, with testing options in the OpenAI Playground and starter code examples provided through the Codex tool.
GPT-Realtime-2 stands out as the primary model for building production voice agents.
It supports handling interruptions, parallel tool calls, and extended context windows up to 128,000 tokens, which helps maintain coherence during longer sessions.
Developers can adjust reasoning effort across five levels to balance speed and depth, and the model includes features like preambles to keep users informed while it processes requests or accesses external systems. In practice, this means an agent can check a calendar, update a customer relationship management record, or retrieve information without breaking the flow of conversation.
Our new voice models are now available in the Realtime API:
GPT-Realtime-2: Build production-ready voice agents that can think harder, take action, handle interruptions, and keep conversations flowing.
GPT-Realtime-Translate: Translate while streaming across more than 70…— OpenAI (@OpenAI) May 7, 2026
Evaluations showed improvements over earlier versions on benchmarks measuring audio intelligence and multi-turn instruction following.
The accompanying models address specific audio tasks.
GPT-Realtime-Translate performs live speech-to-speech translation from more than 70 input languages into 13 output languages while preserving natural pacing and handling domain-specific terms or regional accents. GPT-Realtime-Whisper delivers low-latency streaming transcription that generates captions or notes as speech occurs.
Both operate within the same realtime framework, enabling applications such as multilingual customer support or live event captioning without additional pipelines.
Pricing follows a per-token or per-minute structure depending on the model, with details listed in the API documentation.
A demonstration video released alongside the announcement illustrated these capabilities in action.
One segment showed live translation between French and English, with the system switching seamlessly to German when interrupted and correctly rendering technical terminology. Another depicted a voice assistant managing a personal schedule, pulling calendar details, and updating records while providing verbal updates during processing.
The interactions highlighted reduced latency and more natural turn-taking compared to previous realtime systems.
This update builds on OpenAI's earlier audio releases from 2025, which introduced initial speech-to-speech models and realtime API features.
The new models shift the architecture further toward continuous multimodal reasoning, moving away from rigid pipelines of automatic speech recognition followed by large language model processing and text-to-speech output. Discussions on social platforms noted potential applications in customer service, education, travel planning, and global communication, with some observers pointing to impacts on industries that rely on real-time human intermediaries.
Developer feedback included interest in integration with existing tools and questions about latency in complex multi-step scenarios.
The announcement has prompted conversations about the evolving role of voice interfaces in software.
Companies such as Zillow, Deutsche Telekom, and Priceline have explored similar voice-driven workflows for tasks ranging from property searches to trip management.
At the same time, some users expressed ongoing preferences for legacy models like GPT-4o in certain contexts, reflecting varied experiences with prior voice updates. Overall, the release positions the Realtime API as a foundation for more capable conversational systems that operate directly in audio space. Developers can begin experimenting with the models through the platform today.