The best defense is a good offense: Countering black box attacks by predicting slightly wrong labels
Yannic Kilcher, Thomas Hofmann

TL;DR
This paper proposes a defense mechanism against black-box model theft attacks by slightly perturbing output labels, preventing attackers from successfully training substitute models without affecting normal model usage.
Contribution
Introducing a novel label perturbation method that effectively thwarts model theft attacks while maintaining model utility.
Findings
Perturbation prevents substitute model training.
Defense does not impact normal model predictions.
Effective against common black-box attack strategies.
Abstract
Black-Box attacks on machine learning models occur when an attacker, despite having no access to the inner workings of a model, can successfully craft an attack by means of model theft. The attacker will train an own substitute model that mimics the model to be attacked. The substitute can then be used to design attacks against the original model, for example by means of adversarial samples. We put ourselves in the shoes of the defender and present a method that can successfully avoid model theft by mounting a counter-attack. Specifically, to any incoming query, we slightly perturb our output label distribution in a way that makes substitute training infeasible. We demonstrate that the perturbation does not affect the ordinary use of our model, but results in an effective defense against attacks based on model theft.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Network Security and Intrusion Detection
