Improving the Sensitivity of Backdoor Detectors via Class Subspace Orthogonalization
Guangmingmei Yang, David J. Miller, George Kesidis

TL;DR
This paper introduces Class Subspace Orthogonalization (CSO), a novel method to improve backdoor detection sensitivity by suppressing intrinsic class features, making it more effective against subtle and adaptive attacks.
Contribution
The paper proposes CSO, a new plug-and-play technique that orthogonalizes detection features against intrinsic class subspaces to enhance backdoor detection sensitivity.
Findings
CSO significantly improves detection sensitivity against subtle backdoors.
CSO outperforms existing methods on mixed-label and adaptive attack scenarios.
The approach effectively suppresses intrinsic features, highlighting backdoor triggers.
Abstract
Most post-training backdoor detection methods rely on attacked models exhibiting extreme outlier detection statistics for the target class of an attack, compared to non-target classes. However, these approaches may fail: (1) when some (non-target) classes are easily discriminable from all others, in which case they may naturally achieve extreme detection statistics (e.g., decision confidence); and (2) when the backdoor is subtle, i.e., with its features weak relative to intrinsic class-discriminative features. A key observation is that the backdoor target class has contributions to its detection statistic from both the backdoor trigger and from its intrinsic features, whereas non-target classes only have contributions from their intrinsic features. To achieve more sensitive detectors, we thus propose to suppress intrinsic features while optimizing the detection statistic for a given…
Peer Reviews
Decision·Submitted to ICLR 2026
- Significant Boost in Sensitivity: The most evident strength of CSO is how it noticeably improves the sensitivity of backdoor detection. By removing the “noise” of normal class features, detectors become capable of catching very subtle backdoors that might have previously gone unnoticed. The paper’s introduction clearly articulates this benefit: for the backdoor target class, even if the trigger effect was weak, suppressing intrinsic features lets that weak signal stand out; for non-target clas
### Assumption of Feature Separability The core assumption of CSO is that the backdoor trigger introduces features that lie outside the normal feature subspace of the target class. While generally reasonable (triggers are usually patterns unrelated to the class, like a sticker on a stop sign), one can conceive of cases where this doesn’t hold. An attacker could choose a trigger that is a feature native to the target class. For example, suppose the target class is dogs, and the attacker’s trigger
(1) The method has a very reasonable motivation, and is grounded in theory. (2) The experiment is convincing: it is not grounded in specifically curated datasets that requires very strict feature distribution difference but still outperforms baselines. (3) The presentation of the paper is clear. (4) The authors propose many variants of how the intrinsic features can be integrated into existing backdoor detectors.
Strength (4) somehow also becomes the weakness - I am mostly aware of the applicability of the method: if different integration methods have to be applied for different detector, how far can it go? What if there are more powerful detectors coming around and how hard is it to adapt your method? Taking intrinsic features into account is nothing new in the community, and I suggest the authors to stress the major contribution of your work w.r.t. other existing methods.
Strength: 1. This paper considered the reverse-engineering based backdoor detection, and provide the systemic experiments to compare their performance with other works. 2. The author proposed a regularization which can be easily used as a plugin to improve the performance of the existing reverse-engineering based methods, like NC, NNBD etc.
Weakness: 1. In the second paragraph of page 2, the authors claimed that low poisoning ratio affects the reverse-engineering-based detector? From my humble perspective, whether the model is backdoored or not is the main factor. 2. Moreover, Wang et al.2019 (Neural Cleanse) tries to f find the perturbation/trigger from the input space, not the feature space. But in Section 2.2.2, the authors cited it and claimed a soft mask identifying the intrinsic feature subspace. It is confusing. 3. I ful
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Imbalanced Data Classification Techniques
