Background

YouTube's Automatic Captioning System Uses AI To Understand Sound Effects

YouTube

YouTube has used algorithms to automatically caption speech for years, and thanks to Google's machine learning advances, the system has become better in transcribing spoken words in videos.

On March 24th, 2017, YouTube improves that experience using AI in order for it to understand ambient sounds like laughter, applause and music.

For the launch, the automatic captioning is restricted to just three sounds.

According to Google, these are the sounds that most video producers manually caption. "These were among the most frequent manually captioned sounds, and they can add meaningful context for viewers who are deaf and hard of hearing," wrote the company.

In the announcement, Google engineer Sourish Chaudhuri explained that:

"While the sound space is obviously far richer and provides even more contextually relevant information than these three classes, the semantic information conveyed by these sound effects in the caption track is relatively unambiguous, as opposed to sounds like [RING] which raises the question of “what was it that rang – a bell, an alarm, a phone?"
YouTube - ambient sound caption

YouTube's sound captioning system uses the Deep Neural Network model (DNN). The team behind it has trained the network with "thousands of hours of videos" on a set of weakly labeled data to get the best results.

Whenever a new video is uploaded to YouTube, the DNN system runs and tries to identify any available ambient sounds that the video has. The team achieved this by using the a modified Viterbi algorithm. Then to create the automatic captions, Google uses machine learning to pick out those sounds and display them as text.

The most difficult part of the job for the AI according to the team, was separating and displaying events that tend to happen at the same time. Like laughter and applause.

YouTube

While the feature was rough at first, thanks to machine learning, its capability improved over time, getting "closer and closer to human transcription error rates." And since speech is just one part of the audio, YouTube is seen as among the first to automate sound effect captioning.

YouTube is also aiming to add more common sounds, such as barking, knocking and ringing. Those sounds will pose new challenges for the algorithms, as the AI will need to figure out if a ringing sound is coming from an alarm, phone or doorbell, for example.

Google said that two-thirds of participants in a study found that sound effect captions enhance their video experience. And while it's prone to make mistakes, Google thinks that users won't see the error as something that affects the benefits.