Prompting the Unseen: Detecting Hidden Backdoors in Black-Box Models
Zi-Xuan Huang, Jia-Wei Chen, Zhi-Peng Zhang, Chia-Mu Yu

TL;DR
This paper introduces extsc{BProm}, a novel black-box detection method that leverages visual prompting to identify hidden backdoors in models by analyzing classification accuracy discrepancies.
Contribution
The study proposes a new backdoor detection approach using visual prompting to detect class subspace inconsistencies in black-box models.
Findings
extsc{BProm} effectively detects backdoors in black-box models.
Visual prompting reveals class subspace misalignments caused by backdoors.
Extensive experiments validate extsc{BProm}'s robustness and accuracy.
Abstract
Visual prompting (VP) is a new technique that adapts well-trained frozen models for source domain tasks to target domain tasks. This study examines VP's benefits for black-box model-level backdoor detection. The visual prompt in VP maps class subspaces between source and target domains. We identify a misalignment, termed class subspace inconsistency, between clean and poisoned datasets. Based on this, we introduce \textsc{BProm}, a black-box model-level detection method to identify backdoors in suspicious models, if any. \textsc{BProm} leverages the low classification accuracy of prompted models when backdoors are present. Extensive experiments confirm \textsc{BProm}'s effectiveness.
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. Proposes using Visual Prompting as a backdoor detection technique, which appears to be a novel application of Visual Prompting 2. Demonstrates high AUROC values across a variety of tasks.
1. This submission could benefit from a clearer explanation for why visual prompting is the right tool for backdoor model detection. Improvements to Figures 1 and 2 may aid in clarifying this. 2. This submission is lacking in detailed comparisons of other model-level detection techniques. Other methods are cited in Section 2 but not described in sufficient detail to understand how this work compares to those existing methods. While BProm was compared qualitatively to MNTD in section 5.3, there w
The paper is rich in experiments and based on this the authors could gain novel insights. On top of it, their proposed defense shows promising results even on adaptive attacks. 1. Novel Approach: BPROM introduces an innovative methodology for backdoor detection that leverages class subspace inconsistency, which is a relatively unexplored area in the context of black-box models. 2. Effective Detection: The experimental results demonstrate strong performance in identifying all-to-one backdoors
1. Limited Scope: While BPROM performs well against all-to-one backdoors, its effectiveness diminishes with all-to-all backdoors, highlighting a significant limitation in its applicability. The reviewer thinks that based on the underlying structure of prompts, because pompts do not have so many parameters to learn, since they are just a frame around the image. 2. Future Work Needed: The authors acknowledge the need for further research to address the challenges posed by all-to-all backdoors,
- The paper presents an interesting approach for detecting backdoor attacks. - The proposed approach seems to work in a range of settings. - The authors perform a good ablation to study the effect of different hyperparameters on the defense's efficacy.
- Some of the examples are confusing. Specifically, Figure 1 presents a very confusing example of VP since the digit 3 is not expected to map an actual class of ImageNet. I think a better example is to choose some CIFAR-10 image that has a label in ImageNet and update the figure accordingly. (Figure 2 is also confusing.) - The authors do not justify the choice of VP as the space where the algorithm is applied. What if the detection algorithm is applied to other spaces, e.g., the representation s
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications
