'Underspecification' Is One Of AI's Biggest Weaknesses, Said Google

Artificial Intelligence products are getting better and better. And to many extents, they have become almost as good as humans, and sometimes even better, in a wide range of abilities.

That happens because their creators have become better in teaching them. With more capable computer hardware and the abundance of data, as well as the knowledge taken from others, researchers have managed to train more promising AI products.

However, machines that are trained can still make mistakes that humans would never fall for.

For example, using a method called adversarial attack, a small change to an image, often ignored or not even visible to humans, can force image-recognition AIs to misinterpret what they see.

This is why researchers in the AI field are desperately trying to understand the limitation of machine learning, as well as understanding the so-called black box in AIs.

For its part, Google is the titan of the web. But the company has gone far beyond a search engine to also include many fields of research, including AI.

On the web, Google is one of the biggest adopters of the technology. Even its CEO said that AI is more "profound" than fire or electricity.

2 brains

Its researchers shared a report about what they think is one of the biggest weaknesses at the heart of machine learning, that may have led to all of AI problems in real life.

According to the researchers, one of the primary reasons that machine learning models often perform quite differently in the real world than they do during testing and development, is because of underspecification.

The analogy is that, AI learns linear equations whereas it will face more unknowns than the number of equations it learned. The excess of possibilities can lead to differing behavior across networks trained on the same dataset.

Another way of saying it, underspecification is a the term that refers to a statistical concept that describes issues where observed phenomena have many possible causes, not all of which are accounted for by the AI model.

AIs that fail in the lab under controlled testing can have its algorithms tweaked. But when the AIs have been unleashed to the real world, failure to operate as intended should raise questions.

Is there any mismatch between training/development and real-world performance?

The answer is true. One of the most common reasons that AI models fail in real-world tasks, is known with a concept called data shift. This refers to the fundamental differences between the type of data used to develop a machine learning model and the data fed into the model during application in the lab, and in the real world.

As an example, computer vision models trained on high-quality image data can struggle to perform when it is fed with data captured by low-quality cameras found in the day-to-day environment.

Read: Paving The Roads To Artificial Intelligence: It's Either Us, Or Them

Underspecification in a simple epidemiological model.
Underspecification in a simple epidemiological model. (Credit: D’Amour, K. Heller, D. Moldovan, et al)

The typical method of training an AI involves feeding the machine learning model a large amount of data that it can analyze and extract relevant patterns from. This way, the model can learn to spot or predict features.

After that, the model needs to be fed examples it has never seen before, and asked to predict the nature of those examples based on what it has learned.

Once the model has achieved a certain level of accuracy, the training is usually considered complete.

But this is where the training falls short.

According to the Google research team, there are more that needs to be done to ensure the models can truly understand things that are beyond its training data.

Usually, methods of training AIs produce models that may all pass their tests, but may differ in many small ways that seem insignificant. But that insignificant shouldn't be taken lightly.

Different nodes in the models will have different random values assigned to them. Training data could also be selected or represented in different ways. These variations are small and often arbitrary. They are often overlooked when they don’t have a huge impact on how the models perform during training.

However, when the impact of all these small changes accumulates, they can lead to major variations in real-world performance.

This underspecification is problematic because it means that, even if the training process is capable of producing good models, it can also produce poor models and the difference wouldn’t be discovered until the model came out of production and entered use.

Stress test performance varies across identically trained medical imaging models.
Stress test performance varies across identically trained medical imaging models. (Credit: D’Amour, K. Heller, D. Moldovan, et al)
Identically trained retinal imaging models show systematically different behavior on stress tests.
Identically trained retinal imaging models show systematically different behavior on stress tests. (Credit: D’Amour, K. Heller, D. Moldovan, et al)

If these can be spotted in advance while the AI is still under development, there will be ways to address the underspecification problems. During this stage, the creators of the AI can revised the machine learning protocols, retesting them for these shortcomings that can happen in the real world.

One of which, is using the stress tests to see how well a model can perform on real-world data and to pick up potential problems. According to D’Amour, machine learning researchers and engineers need to be doing a lot more stress testing before releasing models into the wild.

“Designing stress tests that are well-matched to applied requirements, and that provide good “coverage" of potential failure modes is a major challenge,” said the team.

However, this requires a good understanding of the way the model can go wrong.

This can be a hard thing to do, given that stress tests are required to be tailored to specific tasks using data from the real world, data which can be hard to come by for certain tasks and contexts.

“We need to get better at specifying exactly what our requirements are for our models. Because often what ends up happening is that we discover these requirements only after the model has failed out in the world,” explained D’Amour to MIT Technology Review.

What this means, machine learning limitations on its prediction credibility should force researchers into rethinking certain AI applications.

Special attention is needed, particularly where AI is part of system that involves human lives, like in self-driving cars and medical imaging.

In these scenarios, relatively small flaws in machine learning capabilities could have life and death implications.

Further reading: Technology And Artificial Intelligence, And Where They Fall Short