An Adaptive Black-box Defense against Trojan Attacks (TrojDef)
Guanxiong Liu, Abdallah Khreishah, Fatima Sharadgah, Issa Khalil

TL;DR
This paper introduces TrojDef, a practical black-box defense method against Trojan attacks on neural networks that monitors prediction confidence stability under noise to detect Trojan inputs without needing access to model internals.
Contribution
The work proposes TrojDef, a novel black-box detection approach based on prediction confidence bounds, which outperforms existing defenses and is robust across various settings.
Findings
TrojDef effectively detects Trojan inputs using confidence stability analysis.
It outperforms state-of-the-art defenses in accuracy and robustness.
TrojDef remains stable under different model architectures and training conditions.
Abstract
Trojan backdoor is a poisoning attack against Neural Network (NN) classifiers in which adversaries try to exploit the (highly desirable) model reuse property to implant Trojans into model parameters for backdoor breaches through a poisoned training process. Most of the proposed defenses against Trojan attacks assume a white-box setup, in which the defender either has access to the inner state of NN or is able to run back-propagation through it. In this work, we propose a more practical black-box defense, dubbed TrojDef, which can only run forward-pass of the NN. TrojDef tries to identify and filter out Trojan inputs (i.e., inputs augmented with the Trojan trigger) by monitoring the changes in the prediction confidence when the input is repeatedly perturbed by random noise. We derive a function based on the prediction outputs which is called the prediction confidence bound to decide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
