Background

This 'Octave' AI Becomes The First Large Language Model Made For Text-To-Speech

This 'Octave' AI Becomes The First Large Language Model Made For Text-To-Speech

Text-to-speech (TTS) technology has come a long way.

What began in the 1990s and early 2000s, the technology processes text using Natural Language Processing (NLP), converts it into phonemes, synthesizes speech using AI models, and generates human-like audio output for various applications.

While the technology uses AI, TTS was born way before AI becomes a buzzword of the internet.

Since OpenAI kickstarted the AI arms race following the introduction of ChatGPT, pretty much all industries have heard and learned about Large Language Models and how the technology can create chatbots that are so humanlike and convincing.

But none of the AIs however, are actually Large Language Models purposefully made for voice synthesis.

And here, Hume AI wants to be the first with what it calls the 'Octave.'

The idea is to give AI emotions, or at least as if they sounded like they have.

In which Octave is really convincing.

Hume said that Octave is "the first text-to-speech system that understands what it’s saying," and that in a blog post, it provides a number of examples where Octave can really "generate more natural-sounding, context-aware speech."

"Today we’re launching Octave (Omni-capable text and voice engine), the first LLM for text-to-speech. Unlike conventional TTS that merely “reads” words, Octave is a speech-language model that understands what words mean in context, unlocking a new level of expressiveness and nuance—and new AI voice capabilities. It acts out characters, generates voices from prompts, and takes instructions to modify the emotion and style of a given utterance."

To make this happen, Hume uses an a state-of-the-art LLM that has been trained to understand and synthesize speech.

Trained on 1000x more language than traditional TTS, Octave is able to understand scripts like a human actor, and with that in mind, it can deliver realistic-sounding emotions, sarcasm, pace, word emphasis, and more.

And because the approach allows the AI to sound like it has emotions, its LLM ability that allows it to understand things like plot twists emotional cues, character traits, and how to combine them, is able to read love letters tenderly, and make sports announcement energetically.

In other words, this speech-language model can predict the tune, rhythm, and timbre of speech, inferring when to whisper secrets, shout triumphantly, or calmly explain a fact.

"Give directions like 'sound sarcastic' or 'whisper fearfully.' For the first time, creators have total control," said Hume.

In other words, the AI can generate voice using only prompt, or an "evocative script."

The LLM is able to interprets the meaning and style of a script, such as pronouns, contractions, and vocabulary to generate a coherent voice for the character.

If they want, users can further guide Octave by prompting it with a description of the character, available via Voice Design feature.

According to Hume, this description can encompass any number of characteristics, from "patient, empathetic counselor with an AMSR voice" to "dramatic medieval knight," to "middle-aged, Hollywood movie trailer narrator."

It can be nuanced, combining specific accents, demographics, occupational roles, and more. For more information, visit our voice prompting guide.

The goal of having Octave, is like to have a human actor, who can read out scripts 'with any instructed emotion or speaking style."

As for future projects, Hume wants to make Octave to be able to clone a voice extracted from as little as 5 seconds of audio.

"We plan on launching Voice Cloning in the coming weeks," the team said.

The team is also working to provide safe ways to offer this capability.

"We’re continuing to train Octave and improve its capabilities. For this initial launch, we focused largely on English-language speech, but Octave can also speak Spanish fluently and we hope to improve its proficiency in other languages soon. We also expect to improve Octave’s core capabilities over the coming weeks. In particular, we remain focused on expressive speech generation, prompting for different emotions and styles, generating new voices, and smooth conversations among multiple speakers," the company added.

Published: 
27/02/2025