Background

To Beat OpenAI, Google Expands Its AI Capabilities With A 1,000-Language Model

Google USM

The AI race is on, and tech companies are using whatever method they can to build their own AI armies.

Google is the tech giant of the web and beyond. And this time, following OpenAI's revelation of ChatGPT, Google is keen on building a language model that can literally understand more languages that anything in existence.

In an update, the company gave more information about its Universal Speech Model (USM), considering the update a “critical first step” in achieving its plans.

It all began back in November 2022, when Google unveiled the 1,000 Languages Initiative; a bold and ambitious pledge to develop a machine-language model that caters that many spoken languages in the world, enabling billions of people to experience inclusion.

USM is essentially a "family of state-of-the-art" speech model with two billion parameters trained on 12 million hours of speech and 28 billion sentences spanning over 300 languages.

At this time, YouTube already uses USM to show closed captions, for example.

Google USM
Google's USM’s overall training pipeline. (Credit: Google)

The AI can also perform automatic speech recognition (ASR), with the ability to automatically detect and translate widely-spoken languages like English and Mandarin, as well as under-resourced languages like Amharic, Cebuano, Assamese, and Azerbaijani.

In all, USM can perform ASR across over 100 languages, with less than 30% word error rate (WER).

This should be noted because this is lower than OpenAI's Whisper (large-v2) at YouTube Captions' Test Set.

Google USM
Google's USM supports all 73 languages in the YouTube Captions' Test Set, and outperforms Whisper on the languages it can support with lower than 40% WER. Lower WER is better. (Credit: Google)

According to Google in a blog post:

"The development of USM is a critical effort towards realizing Google’s mission to organize the world’s information and make it universally accessible. We believe USM’s base model architecture and training pipeline comprise a foundation on which we can build to expand speech modeling to the next 1,000 languages."

It's also worth noting that Google's AI model managed to achieve a lower WER with and without training on in-domain data.

Google USM
Comparison of Google USM (with or without in-domain data) and Whisper results on ASR benchmarks. Lower WER is better. (Credit: Google)

And as for speech translation, Google's AI model, which includes text via the second stage of our pipeline, achieves "state-of-the-art quality with limited supervised data."

The score the model achieved, is higher than OpenAI's Whisper in all segments.

Google USM
CoVoST BLEU score. Higher BLEU is better.. (Credit: Google)

Google managed to do all this because it pre-training the encoder of the model on a large unlabeled multilingual dataset. The company then combined self-supervised learning with fine-tuning the model on a smaller set of labeled data. so it can easily recognize under-represented languages.

The approach allows the model training process to be effective in adapting to new languages and data.

Google also said that its USM uses the encoder-decoder architecture, where the decoder can be CTC, RNN-T, or LAS.

A the AI race is becoming more intense, thanks to OpenAI, Google is striving hard to fulfill its mission of organizing the world’s information and making it universally accessible.

And USM can help Google achieve that goal.

Google plans to use the base model architecture and training pipeline of USM as its foundation to expand speech modeling to the next 1,000 languages.

Published: 
12/03/2023