Confusing and Detecting ML Adversarial Attacks with Injected Attractors
Jiyi Zhang, Ee-Chien Chang, Hwee Kuan Lee

TL;DR
This paper introduces a novel proactive defense mechanism against ML adversarial attacks by injecting attractors into models, which misleads attackers and enhances detection, showing significant reduction in attack success rates.
Contribution
The paper proposes a generic method to inject attractors from watermarking schemes into models to confuse and detect adversarial attacks, improving robustness and explainability.
Findings
Reduces attack success rate on CIFAR-10 to 1.9%
Leverages watermarking for scalable attractor injection
Outperforms existing defenses like LID, FS, MagNet
Abstract
Many machine learning adversarial attacks find adversarial samples of a victim model by following the gradient of some attack objective functions, either explicitly or implicitly. To confuse and detect such attacks, we take the proactive approach that modifies those functions with the goal of misleading the attacks to some local minimals, or to some designated regions that can be easily picked up by an analyzer. To achieve this goal, we propose adding a large number of artifacts, which we called , onto the otherwise smooth function. An attractor is a point in the input space, where samples in its neighborhood have gradient pointing toward it. We observe that decoders of watermarking schemes exhibit properties of attractors and give a generic method that injects attractors from a watermark decoder into the victim model . This principled approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Generative Adversarial Networks and Image Synthesis
