Rethinking Backdoor Detection Evaluation for Language Models

Jun Yan; Wenjie Jacky Mo; Xiang Ren; Robin Jia

arXiv:2409.00399·cs.CL·September 23, 2025

Rethinking Backdoor Detection Evaluation for Language Models

Jun Yan, Wenjie Jacky Mo, Xiang Ren, Robin Jia

PDF

Open Access

TL;DR

This paper critically evaluates the robustness of current backdoor detection methods for language models, revealing their sensitivity to training intensity and exposing limitations in existing benchmarks.

Contribution

It demonstrates that existing detection methods are highly sensitive to backdoor planting strategies and highlights the need for more robust evaluation benchmarks.

Findings

01

Detection success varies with training intensity of backdoors

02

Current benchmarks may not reflect real-world robustness

03

Existing methods are less effective against conservatively planted backdoors

Abstract

Backdoor attacks, in which a model behaves maliciously when given an attacker-specified trigger, pose a major security risk for practitioners who depend on publicly released language models. As a countermeasure, backdoor detection methods aim to detect whether a released model contains a backdoor. While existing backdoor detection methods have high accuracy in detecting backdoored models on standard benchmarks, it is unclear whether they can robustly identify backdoors in the wild. In this paper, we examine the robustness of backdoor detectors by manipulating different factors during backdoor planting. We find that the success of existing methods based on trigger inversion or meta classifiers highly depends on how intensely the model is trained on poisoned data. Specifically, backdoors planted with more aggressive or more conservative training are significantly more difficult to detect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling