BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in   Instruction-tuned Language Models

Yi Zeng; Weiyu Sun; Tran Ngoc Huynh; Dawn Song; Bo Li; Ruoxi Jia

arXiv:2406.17092·cs.CR·June 26, 2024

BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

Yi Zeng, Weiyu Sun, Tran Ngoc Huynh, Dawn Song, Bo Li, Ruoxi Jia

PDF

Open Access 1 Repo

TL;DR

BEEAR is a novel defense method that detects and mitigates safety backdoors in large language models by identifying universal embedding perturbations, significantly reducing attack success rates without harming model utility.

Contribution

BEEAR introduces a bi-level optimization approach that leverages embedding space analysis to effectively remove safety backdoors in instruction-tuned language models.

Findings

01

Reduces RLHF backdoor attack success rate from >95% to <1%.

02

Eliminates instruction-tuning backdoors targeting malicious code from 47% to 0%.

03

Maintains model utility while enhancing safety defenses.

Abstract

Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. The high dimensionality of potential triggers in the token space and the diverse range of malicious behaviors make this a critical challenge. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations. Experiments show BEEAR reduces the success rate of RLHF time backdoor attacks from >95% to <1% and from 47% to 0% for instruction-tuning time backdoors targeting malicious code generation, without compromising model utility. Requiring only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

reds-lab/beear
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Software Testing and Debugging Techniques