
The large language model (LLM) race shows no signs of slowing as companies continue releasing new developments.
Since the release of ChatGPT in 2022, the AI industry has been shaped by rapid competition around large language models. Companies such as OpenAI, Google, Anthropic, and Meta have focused on improving reasoning ability, scaling model size, and expanding context windows. Much of the discussion around AI progress has centered on benchmarks and capabilities in text.
As these models become more capable, another area of development has gained attention: voice interfaces. Instead of focusing only on text generation, some researchers are working on systems that integrate language and speech more directly.
Hume AI is one of the companies exploring this direction.
Its work has largely focused on voice-based interaction and systems that can generate or interpret speech with more nuance. As part of this effort, the company recently released an open-source model called 'TADA,' short for Text-Acoustic Dual Alignment.
The project is intended as a research contribution for building speech systems that integrate language and audio generation within a single architecture.
Today we're releasing our first open source TTS model, TADA!
TADA (Text Audio Dual Alignment) is a speech-language model that generates text and audio in one synchronized stream to reduce token-level hallucinations and improve latency.
This means:
→ Zero content hallucinations… pic.twitter.com/4JMQSghqCz— Hume AI (@hume_ai) March 10, 2026
Many existing text-to-speech systems treat text generation and speech synthesis as separate steps.
A language model first produces text, and a second system converts that text into audio. Some recent approaches attempt to combine these processes using language-model-style architectures, but they often face a structural imbalance between text tokens and audio tokens. Audio requires far more tokens per second than text, which increases sequence length and computational cost during generation.
TADA addresses this mismatch by aligning text tokens with acoustic representations in a one-to-one format.
Instead of generating many audio tokens for each piece of text, the model produces a single acoustic vector aligned with each text token. Text and speech therefore progress together in the same sequence.
This approach reduces the number of tokens required to represent speech and simplifies the generation process within the model.
Reducing the token count affects both speed and context length.
When fewer tokens are required to represent audio, the model can generate speech more efficiently and handle longer outputs within the same context window.
According to the project documentation, this allows speech sequences that extend significantly longer than those in many existing LLM-based text-to-speech systems, which often struggle with long audio generation due to token limits.
Another aspect of the design is the reduction of inconsistencies between generated text and speech. In some speech generation pipelines, the text content and the produced audio can diverge, resulting in skipped or repeated words. By aligning text and audio tokens directly, the model aims to maintain consistency between what is written and what is spoken during generation.
Read our blog: https://t.co/KqNp4jD0lV
— Hume AI (@hume_ai) March 10, 2026
The open-source release includes two models: a smaller English model and a larger multilingual model. Both are built on top of the Llama architecture and include the components needed to run the system, such as the tokenizer and audio codec. The models and related tools are available through the project’s repository and the Hugging Face model collection.
The release contributes to ongoing research on unified speech-language models.
While much of the AI ecosystem continues to focus on text-based capabilities, work like TADA explores how language models can generate and process speech within the same framework.