
YouTube has used algorithms to automatically caption speech for years, and thanks to Google's machine learning advances, the system has become better in transcribing spoken words in videos.
On March 24th, 2017, YouTube improves that experience using AI in order for it to understand ambient sounds like laughter, applause and music.
For the launch, the automatic captioning is restricted to just three sounds.
According to Google, these are the sounds that most video producers manually caption. "These were among the most frequent manually captioned sounds, and they can add meaningful context for viewers who are deaf and hard of hearing," wrote the company.
In the announcement, Google engineer Sourish Chaudhuri explained that:

YouTube's sound captioning system uses the Deep Neural Network model (DNN). The team behind it has trained the network with "thousands of hours of videos" on a set of weakly labeled data to get the best results.
Whenever a new video is uploaded to YouTube, the DNN system runs and tries to identify any available ambient sounds that the video has. The team achieved this by using the a modified Viterbi algorithm. Then to create the automatic captions, Google uses machine learning to pick out those sounds and display them as text.
The most difficult part of the job for the AI according to the team, was separating and displaying events that tend to happen at the same time. Like laughter and applause.

While the feature was rough at first, thanks to machine learning, its capability improved over time, getting "closer and closer to human transcription error rates." And since speech is just one part of the audio, YouTube is seen as among the first to automate sound effect captioning.
YouTube is also aiming to add more common sounds, such as barking, knocking and ringing. Those sounds will pose new challenges for the algorithms, as the AI will need to figure out if a ringing sound is coming from an alarm, phone or doorbell, for example.
Google said that two-thirds of participants in a study found that sound effect captions enhance their video experience. And while it's prone to make mistakes, Google thinks that users won't see the error as something that affects the benefits.