A Study of Backdoors in Instruction Fine-tuned Language Models

Jayaram Raghuram; George Kesidis; David J. Miller

arXiv:2406.07778·cs.CR·August 23, 2024

A Study of Backdoors in Instruction Fine-tuned Language Models

Jayaram Raghuram, George Kesidis, David J. Miller

PDF

Open Access

TL;DR

This paper investigates backdoor data poisoning in instruction fine-tuned language models, analyzing attack effectiveness under various scenarios and proposing defenses based on word frequency and clean fine-tuning.

Contribution

It provides a comprehensive analysis of backdoor attack parameters and introduces two novel defenses to mitigate such threats in instruction fine-tuning.

Findings

01

Backdoor attacks are effective across multiple scenarios.

02

Word-frequency based detection can identify backdoor triggers.

03

Fine-tuning with clean data can mitigate backdoor effects.

Abstract

Backdoor data poisoning, inserted within instruction examples used to fine-tune a foundation Large Language Model (LLM) for downstream tasks (\textit{e.g.,} sentiment prediction), is a serious security concern due to the evasive nature of such attacks. The poisoning is usually in the form of a (seemingly innocuous) trigger word or phrase inserted into a very small fraction of the fine-tuning samples from a target class. Such backdoor attacks can: alter response sentiment, violate censorship, over-refuse (invoke censorship for legitimate queries), inject false content, or trigger nonsense responses (hallucinations). In this work we investigate the efficacy of instruction fine-tuning backdoor attacks as attack "hyperparameters" are varied under a variety of scenarios, considering: the trigger location in the poisoned examples; robustness to change in the trigger location, partial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques