Background

OpenAI ChatGPT ‘Advanced Voice Mode’ Breathes To Speak: Next Level Anthropomorphism

OpenAI ChatGPT

Robots should sound like robots, right? Wrong.

Well, they did, once upon a time in science-fiction flicks. In the early days when AI-powered robots make their debut on screen, they sound like monotone and mechanical. Sometimes, they can have stilted and have broken speech patterns, have beeping and booping sounds, talk repetitive phrases, and use simple, straightforward sentence structures.

This is intended, made on purpose, created by design.

The way robots sounded in the past, was to underscore their artificiality and separation from human characteristics. They had that electronic sound effects to emphasize their non-human origin.

But in the modern days of technology, following the time when tech companies created chatbots that can speak and listen, especially following the rise of generative AI-powered chatbots, the traditional robot-sounding robot approach is no longer relevant.

People want increasingly lifelike robots.

And here, OpenAI with ChatGPT is bringing this to reality, again.

OpenAI stunned people when it demonstrated an updated voice mode for the most advanced version of ChatGPT earlier this 2024.

Far from the kind of robotic voice that people have come to associate with digital assistants, like Apple's Siri, Google Assistant, or Amazon Alexa, the ChatGPT's AI sounds remarkably lifelike.

It can respond in real time, adjust itself automatically when interrupted, make giggling noises when a user makes a joke, and can judge a speaker’s emotional state based on their tone of voice.

People loved it, and scared of it, but also praised it.

Some people hated it, including Scarlett Johansson, because the AI sounded suspiciously like her.

After pausing the attempt of bringing the feature to users to some issues, OpenAI finally made another go, when it announced 'Advanced Voice Mode.'

The feature that works with the most powerful version of the ChatGPT chatbot OpenAI has at this time, the GPT-4o, and initially rolling out in an alpha stage to some paid users, the feature is again transforming the AI chatbot into something more akin to a virtual, personal assistant that users can engage in natural, spoken conversations in much the same way that they would chat to a friend.

If ChatGPT doesn't sound lifelike enough, this Advanced Voice Mode is giving the AI the ability to sing, hum, imitate accents, correct language pronunciation, perform narrative storytelling, and more.

In a video, the OpenAI's GPT-4o-powered voice mode is even heard telling a user that it needs to breathe — "just like anybody speaking."

In the video, a human user asks the AI to say a bunch of tongue twisters.

After obliging to the request, the responds that it was "definitely a mouthful."

"I want you to do it again, but way faster," the person chatting with the language model demands, "and without taking any breaths or pauses."

Rather than attempting the feat, the LLM refuses.

"I wish I could," the male-voiced model responds, "but I need to breathe just like anybody speaking. Wanna give it a shot yourself and see how fast you can go?"

Giving ChatGPT a GPT-4o treatment does give the AI chatbot more sophistication.

Advanced Voice Mode is able to simulates audible pauses for breath because it was trained on audio samples of humans speaking that obviously have inhaling sounds.

The model has learned to simulate inhalations at seemingly appropriate times after being exposed to hundreds of thousands, if not millions, of examples of human speech.

Large language models (LLMs) are already master imitators, and this skill has now extended to the audio domain.

Read: 'Anthropomorphizing AI' And Why AI Is 'One Of The Most Unfortunate Names'

"Advanced Voice Mode on ChatGPT features more natural, real-time conversations that pick up on and respond with emotion and non-verbal cues," explained OpenAI in a dedicated help page.

OpenAI initially said it had planned to begin the advanced voice mode rollout in June.

However, it had to delay the plan because it needed "one more month to reach our bar to launch" to test the tool’s safety and ensure it can be used by a lot of people simultaneously while still maintaining its real-time responses ability.

Then, OpenAI trailed it to more than 100 testers seeking to identify potential weaknesses, "who collectively speak a total of 45 different languages, and represent 29 different geographies."

Among its safety measures, the company said that the voice mode isn't able to use any voices beyond four, pre-set options that it created in collaboration with voice actors.

That, in order to avoid impersonation, and also to block certain requests that aim to generate music or other copyrighted audio.

OpenAI says the tool shall also have the same protections as ChatGPT’s text mode to prevent it from generating illegal or "harmful" content.

Advanced voice mode that OpenAI plans to launch to everyone else also has one major difference from the demo, in which users shall no longer be able to access the voice that many believed sounded like Johansson.

Regardless, Advanced Voice Mode can be yet another milestone for OpenAI, as the ease of conversing with ChatGPT’s advanced voice mode could encourage users to engage with the tool more often.

Read: Most Users Think That OpenAI's ChatGPT Is Conscious, Survey Finds

Published: 
04/08/2024