How Artificial Intelligence Can Be Tricked And Fooled By 'Adversarial Examples'

Algorithms can be so sophisticated that they can exceed humans in tasks. However, unlike humans, they are prone to a type of problem called an "adversarial examples."

These are specially designed optical illusions that fool computers into doing things like mistakes in understanding a picture. The adversarial examples can be images, sounds, or even a paragraphs of text.

What they do, is acting as hallucinations for algorithms, by altering the inputs to the machine learning models so make the model make a mistake. Adversarial example works on across different mediums.

One example is the panda-gibbon mix-up.

Starting with an image of a panda, the adversarial example adds a small perturbation that has been calculated to make the image be recognized as a gibbon with high confidence.

The adversarial input overlaid the panda, causing the classifier to miscategorize a panda as a gibbon.

Adversarial examples can also be printed out on standard paper, and still be able to fool systems.

Below is an how adversarial examples can be printed out on normal paper and photographed with a standard resolution smartphone. In this case, it tricked the system to recognize a washer as a safe.

Adversarial examples have the potential to be dangerous. For example, attackers could target autonomous vehicles by using stickers or paint to create an adversarial stop sign that the vehicle would interpret as a ‘yield’ or other sign.

Like causing it to mistake a stop sign for a speed limit one.

In fact, people have been able to fool AIs to beat other kinds of algorithms, like spam filters.

Previous adversarial examples have largely been designed in “white box” settings. This is where computer scientists have access to the underlying mechanics that power an algorithm. In these researches, the researchers learned how the computer system was trained, so they were able to trick it.

These kinds of adversarial examples are considered less threatening, because they don’t resemble scenarios in the real world, where an attacker wouldn’t have access to a proprietary algorithm.

Reinforcement learning agents can also be manipulated by adversarial examples, according to a research from UC Berkeley, OpenAI, and Pennsylvania State University. The research have shown that widely-used RL algorithms, such as DQN, TRPO, and A3C, are vulnerable to adversarial inputs.

Defenses Against Adversarial Examples

Adversarial examples have given the fact that even the simplest modern algorithms, for both supervised and reinforcement learning, can behave in surprising ways that are not intended.

Traditional techniques for making machine learning models more robust, generally do not provide a great defense against adversarial examples. So far, only two methods have provided a significant defense.

  • Adversarial training: A brute force solution where researchers generate a lot of adversarial examples and explicitly train the model not to be fooled by each of them.
  • Defensive distillation: Training the model to output probabilities of different classes, rather than hard decisions about which class to output. The probabilities are supplied by an earlier model, trained using the same task using hard class labels. This creates a model which as a smoothed out surface in the directions an adversary will typically try to exploit, making it difficult for them to discover adversarial input tweaks that lead to incorrect categorization.

But still, these strategies can also be broken if the attacker has more computational firepower.

Adversarial examples are hard to defend against because it is difficult to construct a theoretical model of the adversarial example crafting process. Adversarial examples are non-linear and non-convex, and there isn't any good theoretical tool for describing the solutions to these complicated optimization problems.

So here, it's very hard to make any kind of theoretical defense to rule out a set of adversarial examples.

Adversarial examples are also hard to defend against because they require machine learning models to produce good outputs for every possible input. Most of the time, machine learning models work well but only when working on a small amount of all the many possible inputs they might encounter.

The only solution that can really defense a system against adversarial examples, is an adaptive one.