Robust Backdoor Removal by Reconstructing Trigger-Activated Changes in Latent Representation

Kazuki Iwahana; Yusuke Yamasaki; Akira Ito; Takayuki Miura; Toshiki Shibahara

arXiv:2511.08944·cs.LG·November 13, 2025

Robust Backdoor Removal by Reconstructing Trigger-Activated Changes in Latent Representation

Kazuki Iwahana, Yusuke Yamasaki, Akira Ito, Takayuki Miura, Toshiki Shibahara

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel backdoor removal technique that accurately reconstructs trigger-activated changes in the latent space, leading to more effective detection and removal of backdoors across various datasets and models.

Contribution

It proposes a new method to reconstruct trigger-activated changes in latent representations using convex quadratic optimization, improving backdoor detection and removal accuracy.

Findings

01

Outperforms existing defenses on CIFAR-10, GTSRB, TinyImageNet

02

Achieves high clean accuracy while removing backdoors

03

Effective across multiple attack types and architectures

Abstract

Backdoor attacks pose a critical threat to machine learning models, causing them to behave normally on clean data but misclassify poisoned data into a poisoned class. Existing defenses often attempt to identify and remove backdoor neurons based on Trigger-Activated Changes (TAC) which is the activation differences between clean and poisoned data. These methods suffer from low precision in identifying true backdoor neurons due to inaccurate estimation of TAC values. In this work, we propose a novel backdoor removal method by accurately reconstructing TAC values in the latent representation. Specifically, we formulate the minimal perturbation that forces clean data to be classified into a specific class as a convex quadratic optimization problem, whose optimal solution serves as a surrogate for TAC. We then identify the poisoned class by detecting statistically small $L^{2}$ norms of…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

This method uses clean data to generate the perturbations, making it suitable for realistic defender settings where poisoned data are unavailable. Later, perturbations can be used for both detection and removal.

Weaknesses

Comparison with feature-space defenses. While the paper is inspired by Trigger-Activated Changes (TAC), its practical implementation closely resembles feature-space backdoor defenses[a]. However, the paper provides limited comparative analysis with these prior methods. A deeper comparison would strengthen the contribution and clarify the novelty. Adaptive evaluation. The work does not evaluate the defense under adaptive or defense-aware backdoor attacks. Since the proposed method depends on the

Reviewer 02Rating 6Confidence 5

Strengths

- Extensive experiments on multiple datasets and attacks demonstrate better or comparable performance over prior methods. - The presentation of the paper is easy to follow. - The motivation is clear, and the proposed method addresses an important problem.

Weaknesses

- The experiments are primarily on ResNet-18. - The method assumes one poisoned class, which may limit performance in multi-target or all-to-all attacks. - Experiments do not include large datasets, such as ImageNet-1K. - The performance is not significantly better than all baselines, such as FT-SAM.

Reviewer 03Rating 6Confidence 5

Strengths

- The idea of reconstructing TAC in the latent representation through convex quadratic optimization offers a neat and interpretable surrogate approach that does not rely on poisoned data. This reformulation is novel and mathematically well-grounded. - The mathematical explanation is solid and convincing, although it is also not easy to understand. - The empirical evidence of using the smallest-perturbed class is clear and convincing. - The experiments are solid with a comprehensive comparison wi

Weaknesses

- There is a lack of clear outlines for the appendix content, making it hard to find the remaining experiments and the desired explanations. - Solving multiple convex programs per class may be nontrivial for large-scale models (e.g., high-dimensional latent spaces or hundreds of classes). No analysis of time or resource overhead is given. - The extensive experiments related to the scalability are needed to further verify the effectiveness of the proposed method. For example, the experiments on a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis