CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models

Rui Zeng; Xi Chen; Yuwen Pu; Xuhong Zhang; Tianyu Du; Shouling Ji

arXiv:2409.01193·cs.CR·September 12, 2024

CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models

Rui Zeng, Xi Chen, Yuwen Pu, Xuhong Zhang, Tianyu Du, Shouling Ji

PDF

Open Access 1 Repo

TL;DR

CLIBE is a novel framework that detects dynamic backdoors in Transformer-based NLP models by injecting optimized weight perturbations and analyzing model behavior, effectively identifying stealthy backdoor triggers in various NLP tasks.

Contribution

This work introduces CLIBE, the first method to detect dynamic backdoors in NLP models through weight perturbation, addressing a gap in existing static backdoor detection techniques.

Findings

01

CLIBE effectively detects dynamic backdoors across multiple attack types and models.

02

It demonstrates robustness against adaptive attacks and real-world models.

03

CLIBE can identify backdoors in text generation models without trigger samples.

Abstract

Backdoors can be injected into NLP models to induce misbehavior when the input text contains a specific feature, known as a trigger, which the attacker secretly selects. Unlike fixed words, phrases, or sentences used in the static text trigger, NLP dynamic backdoor attacks design triggers associated with abstract and latent text features, making them considerably stealthier than traditional static backdoor attacks. However, existing research on NLP backdoor detection primarily focuses on defending against static backdoor attacks, while detecting dynamic backdoors in NLP models remains largely unexplored. This paper presents CLIBE, the first framework to detect dynamic backdoors in Transformer-based NLP models. CLIBE injects a "few-shot perturbation" into the suspect Transformer model by crafting optimized weight perturbation in the attention layers to make the perturbed model classify a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

raytsang123/clibe
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsByte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Linear Layer · Adam · Dropout · Layer Normalization · Dense Connections · Attention Is All You Need