Background

Baidu's ‘Deep Voice’ AI Can Clone Human Voice, In Just A Matter Of Seconds

The Baidu Deep Voice research team unveiled an AI that is capable of cloning a human voice back in 2017.

After a year of development and perfecting the project, the company's text-to-speech system is able to generate synthetic human faster than ever before and also more efficient.

Before, the AI needed 30 minutes of training material. After the improvement, Deep Voice can do the same job in just a few seconds.

To improve the AI, Baidu used a two-pronged approach to build their neural cloning system:

  1. Speaker adaptation: Using a multi-speaker generative model that uses a backpropagation-based approach.
  2. Speaker encoding: Combining the model that generates speaker embedding from cloned audio with the multi-speaker generative model.

Baidu’s research team used the techniques not only to speed up the time, but to also improve the AI system which they expect will have noteworthy applications in personalizing human-machine interface.

Both Speaker Adaptation and Speaker Encoding require minimal audio, as long as they can provide quality performance that can be integrated into the Deep Voice model along with speaker embeddings without having to compromise the quality of the source audio.

Text-to-speech technology is nothing new. Google, Amazon, Apple, Microsoft and others for example, have made significant contribution in the field.

Baidu here, claims that the technology can go beyond mainstream, to fields like assisting healthcare. The company said that the technology can help people who have lost their voice, giving them the ability to communicate again. Here, Baidu's goal is a bit bold since it remains to be seen if the technology is advanced enough to do this yet.

This can be seen from early reviews where people's reaction have been mixed.

While it has received some positive reviews, the execution was initially far from great. Below are the examples:

Original speech

Cloned speech (speaker embedding adaptation with 1 sample)

Cloned speech (speaker embedding adaptation with 50 samples)

Cloned speech (whole model adaptation with 100 samples)

For this reason, Baidu has asked for a few months more to perfect the technology. According to the researchers, the technology could be upgraded even further with tweaked algorithms and broader datasets.

The world has seen the likes of Deepfake, the controversial AI that can swap a person's face onto other’s body, Nvidia's AI in capable in creating fake humans faces and Google in capable in creating an AI which is able to generate voice indistinguishable from humans.

Here, we are closing it to a world where more things start to be fake, and we can no longer believe our eyes and ears.

Published: 
27/09/2017