Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

Biao Yi; Tiansheng Huang; Sishuo Chen; Tong Li; Zheli Liu; Zhixuan Chu; Yiming Li

arXiv:2506.16447·cs.CR·June 23, 2025

Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

Biao Yi, Tiansheng Huang, Sishuo Chen, Tong Li, Zheli Liu, Zhixuan Chu, Yiming Li

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces BEAT, a black-box defense mechanism that detects backdoor triggers in large language models during inference by analyzing output distribution distortions, effectively mitigating stealthy backdoor and jailbreak attacks.

Contribution

BEAT is a novel black-box detection method that leverages the probe concatenate effect to identify triggered samples without requiring model access.

Findings

01

BEAT effectively detects backdoor triggers across various attacks and models.

02

The method reduces false negatives by analyzing output distribution distortions.

03

BEAT also shows promise in defending against jailbreak attacks.

Abstract

Backdoor unalignment attacks against Large Language Models (LLMs) enable the stealthy compromise of safety alignment using a hidden trigger while evading normal safety auditing. These attacks pose significant threats to the applications of LLMs in the real-world Large Language Model as a Service (LLMaaS) setting, where the deployed model is a fully black-box system that can only interact through text. Furthermore, the sample-dependent nature of the attack target exacerbates the threat. Instead of outputting a fixed label, the backdoored LLM follows the semantics of any malicious command with the hidden trigger, significantly expanding the target space. In this paper, we introduce BEAT, a black-box defense that detects triggered samples during inference to deactivate the backdoor. It is motivated by an intriguing observation (dubbed the probe concatenate effect), where concatenated…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- I think the idea is logical, and it resembles those of successful prior work (essentially inputs that contain the trigger will behave differently in some way or the other than inputs that don't). - Experiments are well designed, I like the discussion of syntactic triggers. By the way, can you try fine-tuning GPT 4o for eval if it's not too expensive? - Well placed in light of prior work, I feel like related work was pretty comprehensive making the motivation clear

Weaknesses

- In the abstract and intro you talk about a probe like it is something I should already know - what is a probe? Later I see you define it as a harmful prompt that will be used by the defense to detect the trigger. Say this earlier perhaps? - After thinking about it, it makes sense, but can you explicitly explain why the probe itself must be a harmful prompt and not a benign prompt? The writing needs some work here. - Distance metric design: doesn't this introduce inference overhead from sampli

Reviewer 02Rating 6Confidence 2

Strengths

+ This defense is based on a straightforward observation: the probe concatenate effect, the probability that the LLM will refuse to the malicious queries will be influenced by the input probe. + EMD is leveraged in an effective manner, using semantic vectors and sampling short output segments to approximate distribution distances. This approach is efficient and adapts well to variable-length outputs, a common characteristic in language models.

Weaknesses

- The threshold $\epsilon$ balancing FPR and TPR could require tuning per model and per dataset, possibly limiting BEAT’s generalizability. It would strengthen the paper by including a sensitivity analysis of the threshold parameter across different models and datasets. - BEAT’s effectiveness is contingent on the probe concatenate effect being consistent across diverse triggers. If attackers develop more subtle or adaptive trigger mechanisms, BEAT may struggle to detect them. To further explore

Reviewer 03Rating 6Confidence 4

Strengths

The idea underlying the defense is intuitive and well-motivated. Section 4.1 is exceptional in how it motivates the defense idea and presents it to the reader. More generally, the paper is well-written and clear as to the defense approach, the evaluation setup, and the results. Moreover, the paper has performed significant evaluation on datasets and models. It has also compared the proposed defense with other defenses. The ablation study is well done, and it shows how the defense reacts to dif

Weaknesses

One of the important aspects of the evaluation in the paper is the robustness against adaptive attackers. The paper evaluates two such attacks in section 5.4: reducing poisoning rate and using advanced syntactic triggers. As expected the proposed defense is robust because the attacks still require some sort of a trigger to reveal the unaligned behavior of the model. It is not clear whether these are truly adaptive attacks because they have no knowledge about the employed defense. A more adaptive

Code & Models

Repositories

clearloveclearlove/beat
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Ethics and Social Impacts of AI