Google's AI Watched Thousands Of Hours Of TV To Read Lips Better Than Humans

Making machines to know how to the world works isn't something new. With Artificial Intelligence (AI) and machine learning, tech has advanced to where machines can do things that was previously unique to humans' trait.

As one of the tech companies focused on the subject, Google and its researchers from DeepMind, have collaborated with scientists from University of Oxford to develop an advanced lip-reading software. Claiming to be the most accurate, they have made machines to be able to read lips, probably better than us humans.

To make this happened, the team used thousands of hours of TV footage from BBC to train the neural network. As a result, the machine can annotate the footage with a 46.8 percent accuracy.

While the number is far from perfect, and far from the accuracy AI can get from transcribing audio. Like for example when Microsoft announced its voice-recognition AI was able to understand speech as good as humans. But to understand the movement of the lips, the task isn't easy for computers and even to us humans.

As a comparison, using the same footage the AI used, a professional human lip-reader was only able to get the right words out 12.4 percent of the time.

This research follows the work of a separate group, also from Oxford, which was released earlier in November. Using similar techniques, the scientists were able to create the lip-reading software with a program called LipNet. With it, the scientists can make machines to be able to reach a 93.4 percent accuracy, compared to 52.3 percent for humans. However, LipNet was only tested on special recorded footage that used volunteers to speak sentences in a formulaic manner.

But DeepMind's software for lip-reading, called the "Watch, Listen, Attend, and Spell", was tested on far more challenging footage that involved transcribing natural and unscripted conversations from BBC's politics shows.

From TV shows like Newsnight, Question Time and the The World today, DeepMind's software "watched" more than 5,000 hours of those footage to understand the differences between sentences out from 17,500 unique words.

By comparison, LipNet read a total of 51 unique words.

The possibilities for such AI advancement can be huge. Some of which is to become the host for applications. For example: helping hearing-impaired people to understand conversation, or to annotate silent films and even control digital personal assistants such as Siri by just mouthing out words/doing lip gesture to s smartphone's camera (can be handy in noisy environment).

While the chance for future progress is certainly possible, people's concern is its ability to be misused. Those lip-reading AIs can be made to aid surveillance on cameras or CCTVs, for example. But the researchers that worked on it, said that there is still a big difference in transcribing a brightly-lit and high-resolution TV footage if compared to grainy amateurish made footage or CCTV that have a lower frame rate.

But still, AI has made another leap.

News

Google

Review

Google's AI Watched Thousands Of Hours Of TV To Read Lips Better Than Humans

TRENDING NOW

Fresh Updates