TL;DR
This paper explores a novel security threat where adversarial weight perturbations can inject backdoors into trained neural networks, enabling targeted misbehavior with minimal weight changes across vision and NLP tasks.
Contribution
It introduces the concept of adversarial weight perturbations for backdoor injection, extending the traditional input-space adversarial attacks to model weights, and demonstrates their effectiveness empirically.
Findings
Backdoors can be injected with minimal weight changes.
Adversarial weight perturbations are effective across vision and NLP tasks.
Universal existence of such perturbations in trained models.
Abstract
Adversarial machine learning has exposed several security hazards of neural models and has become an important research topic in recent times. Thus far, the concept of an "adversarial perturbation" has exclusively been used with reference to the input space referring to a small, imperceptible change which can cause a ML model to err. In this work we extend the idea of "adversarial perturbations" to the space of model weights, specifically to inject backdoors in trained DNNs, which exposes a security risk of using publicly available trained models. Here, injecting a backdoor refers to obtaining a desired outcome from the model when a trigger pattern is added to the input, while retaining the original model predictions on a non-triggered input. From the perspective of an adversary, we characterize these adversarial perturbations to be constrained within an norm around the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
