Background

Text-to-Speech Model 'Miso One' Boasts 8 Billion Parameters And Delivers Low-Latency Speech Faster Than Humans

Miso One

Large language models continue to improve the more they're trained. And voice is no exception

Since the launch of ChatGPT in late 2022, the AI sector has experienced what many describe as the LLM wars. Closed-source leaders such as OpenAI, Anthropic, and Google advanced large language models at a rapid pace, while open-source projects from Meta and independent developers made sophisticated text generation available to a broader audience.

By mid-2026, this rivalry has extended beyond text into areas such as voice synthesis, where open models are beginning to close the performance divide with proprietary platforms.

In this context, Miso Labs, a small San Francisco startup founded in 2025 by Aoden Teo and Cassidy Dalva through Y Combinator, has released its initial major product.

The company focuses on foundation models for voice that aim to replicate the natural flow of human conversation.

Its new model, 'Miso One,' is an 8-billion-parameter open-source text-to-speech system that accepts text input along with optional short audio samples for voice cloning.

The primary distinction from established text-to-speech services lies in accessibility and control.

Offerings from OpenAI, ElevenLabs, Google Cloud, and Microsoft Azure operate exclusively through paid application programming interfaces, routing all requests to remote servers and billing based on usage volume. This model delivers convenience and reliability but requires data to leave the user's system.

Miso One, by contrast, has its full model weights are publicly available on repositories such as GitHub, allowing direct download and local deployment.

What this means, it can run entirely on the user's hardware after download.

It incurs no ongoing costs and maintains full privacy for generated audio, although users must manage their own setup, hardware requirements, and any optimizations.

And what makes it also different, is its approach in speed.

Miso One is engineered for a latency of approximately 110 milliseconds between text input and audio output, which suits real-time conversational tools.

Commercial systems achieve comparable or faster speeds in optimized modes, yet they frequently prioritize consistency across languages or enterprise features.

In terms of expressiveness, the model incorporates architectural elements designed to capture emotional tone, pacing shifts, and natural inflections within a single pass. Commercial alternatives range from ElevenLabs' emphasis on natural prosody to the clearer but sometimes more uniform delivery of OpenAI, Google, or Azure voices.

Independent comparisons of Miso One remain preliminary given its recent arrival.Voice cloning capabilities add another layer of comparison.

Miso One enables one-shot replication from brief audio excerpts, producing results that adapt quickly to new speakers. Paid services offer similar functions, with ElevenLabs requiring moderate sample lengths for strong fidelity and Azure imposing review processes for custom voices. The open-source design of Miso One permits unrestricted experimentation and domain-specific fine-tuning without external oversight, though it also places greater emphasis on user responsibility for ethical application.

When it comes to maturity and practical deployment, the commercial platforms draw on years of iterative development, broad testing, and integrated support ecosystems.

They manage pronunciation nuances, multilingual support, and seamless tool compatibility more readily in standard scenarios.

Miso One shows promise in conversational realism and customization potential but operates at an earlier stage, where ongoing community input and benchmarks will shape its trajectory.

For developers or organizations that value complete ownership, cost-free operation, and modifiable code, it represents a practical option.

Those who prefer minimal setup and proven scalability may continue to favor established API services.

The emergence of models such as Miso One signals a continuing movement in text-to-speech toward wider availability, as the field weighs technical progress against concerns of quality, ethics, and real-world implementation.

Published: 
05/06/2026