Adversarial examples are slight manipulations that cause machine learning algorithms to misclassify images while going unnoticed to the human eye. (Image credit: Depositphotos)This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.
To human observers, the following two images are identical. But researchers at Google showed in 2015 that a popular object detection algorithm classified the left image as “panda” and the right one as “gibbon.” And oddly enough, it had more confidence in the gibbon image.
The algorithm in question was GoogLeNet, a convolutional neural network architecture that won the 2014 ImageNet Large Scale Visual Recognition Challenge (ILSVRC 2014).
Adversarial examples fool machine learning algorithms into making dumb mistakes
The right image is an “adversarial example.” It has undergone subtle manipulations that go unnoticed to the human eye while making it a totally different sight to the digital eye of a machine learning algorithm.
Adversarial examples exploit the way artificial intelligence algorithms work to disrupt the behavior of artificial intelligence algorithms. In the past few years, adversarial machine learning has become an active area of research as the role of AI continues to grow in many of the applications we use. There’s growing concern that vulnerabilities in machine learning systems can be exploited for malicious purposes.
Work on adversarial machine learning has yielded results that range from the funny, benign, and embarrassing—such as to following turtle being mistaken for a rifle—to potentially harmful examples, such as a self-driving car mistaking a stop sign for a speed limit.
Researchers at labsix showed how a modified toy turtle could fool deep learning algorithms into classifying it as a rifle (source: labsix.org)
How machine learning “sees” the world
Before we get to how adversarial examples work, we must first understand how machine learning algorithms parse images and videos. Consider an image classifier AI, like the one mentioned at the beginning of this article.
Before being able to perform its functions, the machine learning model goes through a “training” phase, where it is provided many images along with their corresponding labels (e.g., panda, cat, dog, etc.). The model examines the pixels in the images and tunes its many inner parameters to be able to link each image with its associated label. After training, the model should be able to examine images it hasn’t seen before and link them to their proper labels. Basically, you can think of a machine learning model as a mathematical function that takes pixel values as input and output the label of the image.
Artificial neural networks, a type of machine learning algorithm, are especially well-suited for dealing with messy and unstructured data such as images, sound, and text documents because they contain many parameters and can flexibly adjust themselves to different patterns in their training data. When stacked on top of each other, ANNs become “deep neural networks,” and their capacity for classification and prediction tasks increases.
Deep neural networks are composed of several stacked layers of artificial neurons
Deep learning, the branch of machine learning that uses deep neural networks, is currently the bleeding edge of artificial intelligence. Deep learning algorithms often match—and sometimes outperform—humans at tasks that were previously off-limits for computers such as computer vision and natural language processing.
It is worth noting, however, that deep learning and machine learning algorithms are, at their core, number-crunching machines. They can find subtle and intricate patterns in pixel values, word sequences, and sound waves, but they don’t see the world as humans do.
And this is where adversarial examples enter the picture.
How adversarial examples work
When you ask a human to describe how she detects a panda in an image, she might look for physical characteristics such as round ears, black patches around the eyes, the snout, the furry skin. She might also give other context, such as the kind of habitat she would expect to see the panda in and what kind of poses a panda takes.
To an artificial neural network, as long as running the pixel values through the equation provides the right answer, it is convinced that what it is seeing is indeed a panda. In other words, by tweaking the pixel values in the image the right way, you can fool the AI into thinking it is not seeing a panda.
In the case of adversarial example you saw at the beginning of the article, the AI researchers added a layer of noise to the image. This noise is barely perceptible to the human eye. But when the new pixel numbers go through the neural network, they produce the result it would expect from the image of a gibbon.
Adding a layer of noise to the panda image on the left turns it into an adversarial example
Creating adversarial machine learning examples is a trial-and-error process. Many image classifier machine learning models provide a list of outputs along with their level of confidence (e.g., panda=90%, gibbon=50%, black bear=15%, etc.). Creating adversarial examples involves making small adjustments to the image pixels and rerunning it through the AI to see how the modification affects the confidence scores. With enough tweaking, you can create a noise map that lowers the confidence in one class and raises it in another. This process can often be automated.
In the past few years, there has been extensive work on the workings and effects of adversarial machine learning. In 2016, researchers at Carnegie Mellon University showed that wearing special glasses could fool facial recognition neural networks to mistake them for celebrities.
Researchers at Carnegie Mellon University discovered that by donning special glasses, they could fool facial recognition algorithms to mistake them for celebrities (Source: http://www.cs.cmu.edu)
In another case, researchers at Samsung and Universities of Washington, Michigan and UC Berkley showed that by making small tweaks to stop signs, they could make them invisible to the computer vision algorithms of self-driving cars. A hacker might use this adversarial attack to force a self-driving car to behave in dangerous ways and possibly cause an accident.
AI researchers discovered that by adding small black and white stickers to stop signs, they could make them invisible to computer vision algorithms (Source: arxiv.org)
Adversarial examples beyond images
Adversarial examples do not just apply to neural networks that process visual data. There is also research on adversarial machine learning on text and audio data.
In 2018, researchers at UC Berkley managed to manipulate the behavior of an automated speech recognition system (ASR) with adversarial examples. Smart assistants such as Amazon Alexa, Apple Siri, and Microsoft Cortana use ASR to parse voice commands.
For instance, a song posted on YouTube can be modified in a way that playing it would send a voice command to a smart speaker nearby. A human listener wouldn’t notice the change. But the smart assistant’s machine learning algorithm would pick up that hidden command and execute it.
Adversarial examples also apply to natural language processing systems that process text documents, such as the machine learning algorithms that filter spam emails, block hateful speech on social media, and detect sentiment in product reviews.
In 2019, scientists at IBM Research, Amazon, and the University of Texas created adversarial examples that could fool text classifier machine learning algorithms such as spam filters and sentiment detectors. Text-based adversarial examples, also known as “paraphrasing attacks,” modify the sequences of words in a piece of text to cause a misclassification error in the machine learning algorithm while maintaining coherent meaning to a human reader.
Examples of paraphrased content that force AI algorithms to change their output
Protection against adversarial examples
One of the main ways to protect machine learning models against adversarial examples is “adversarial training.” In adversarial training, the engineers of the machine learning algorithm retrain their models on adversarial examples to make them robust against perturbations in the data.
But adversarial training is a slow and expensive process. Every single training example must be probed for adversarial weaknesses and then the model must be retrained on all those examples. Scientists are developing methods to optimize the process of discovering and patching adversarial weaknesses in machine learning models.
At the same time, AI researchers are also looking for ways that can address adversarial vulnerabilities in deep learning systems at a higher level. One method involves combining parallel neural networks and switching them randomly to make the model more robust to adversarial attacks. Another method involves making a generalized neural network from several other networks. Generalized architectures are less likely to be fooled by adversarial examples.
Adversarial examples are a stark reminders of how different artificial intelligence and the human mind are.