Adversarial machine learning uses accessible model information to launch malicious attacks. Such adversarial attacks attempt to hamper the performance of classifiers on certain tasks by providing the models with false data.
The end goal of such attacks is to deceive the model into giving away sensitive information, making incorrect predictions, or corrupting them.
Most research into adversarial machine learning has been done in the realm of image recognition, in which images are doctored in a way that causes the classifier to make incorrect predictions.
Adversarial attacks generate false data to deceive classifiers. Such inputs are purposely designed to cause ML models to make a mistake. They are corrupted versions of valid data that work as optical illusions for machines.
When the attacker has access to the target model and knows its architecture and parameters, it is called a whitebox attack.
Alternately, when the attacker has zero access to the targeted model and can only work by observing its outputs, it is called a blackbox attack.
Different types of adversarial attacks
Poisoning attacks occur during the training phase of ML systems. They “contaminate” or “poison” the training data of ML models by manipulating the existing data or slapping incorrect labels. Such hacks are likely to work on models that are continuously retrained. For example, reinforcement learning models may be trained daily or biweekly, giving the hacker multiple opportunities to introduce deceptive data to the training data.
Evasion attacks are the most prevalent (and most researched) adversarial attacks and occur after the models have already been trained. The attacks tend to be more practical as they are performed during the deployment phase. They involve imperceptibly altering the data used by the models to make predictions (not the training data), so that it looks legitimate but makes incorrect predictions. The attacks are often launched on a trial and error basis, as the attackers don’t know in advance what data manipulation will finally break the ML system.
Evasion attacks are often associated with computer vision. Attackers can modify images and trick the model into making incorrect predictions. This works because image recognition models have been trained to correlate certain types of pixels with intended variables: If the pixels are re-tailored in a specific way (such as by adding an imperceptible layer of noise), it will cause the model to change its prediction. This poses a threat to medical imaging systems, as they could be tricked into classifying a benign mole as malignant.
Model stealing attacks are aimed at already trained models. The attacker examines the structure and training data of a black box machine system, which could then be used to reconstruct the model or extract the potentially confidential data the model was trained on. Such attacks are usually motivated by financial gain.
How to prevent adversarial attacks
A potential method to counter adversarial attacks is to train ML systems to learn what an adversarial attack might look like ahead of time by incorporating adversarial examples in their training process.
Another method is to regularly modify the algorithms the ML models use to classify data, thereby creating a “moving target” to retain the secrecy of the algorithms.
Developers of ML systems should be aware of the risks associated with them and put in place security measures for cross-checking and verifying information. Furthermore, to avoid pitfalls preemptively, they should make frequent attempts to corrupt their models to detect as many shortcomings as possible in advance.