A Study of Backdoors in Instruction Fine-tuned Language Models
Jayaram Raghuram, George Kesidis, David J. Miller

TL;DR
This paper investigates backdoor data poisoning in instruction fine-tuned language models, analyzing attack effectiveness under various scenarios and proposing defenses based on word frequency and clean fine-tuning.
Contribution
It provides a comprehensive analysis of backdoor attack parameters and introduces two novel defenses to mitigate such threats in instruction fine-tuning.
Findings
Backdoor attacks are effective across multiple scenarios.
Word-frequency based detection can identify backdoor triggers.
Fine-tuning with clean data can mitigate backdoor effects.
Abstract
Backdoor data poisoning, inserted within instruction examples used to fine-tune a foundation Large Language Model (LLM) for downstream tasks (\textit{e.g.,} sentiment prediction), is a serious security concern due to the evasive nature of such attacks. The poisoning is usually in the form of a (seemingly innocuous) trigger word or phrase inserted into a very small fraction of the fine-tuning samples from a target class. Such backdoor attacks can: alter response sentiment, violate censorship, over-refuse (invoke censorship for legitimate queries), inject false content, or trigger nonsense responses (hallucinations). In this work we investigate the efficacy of instruction fine-tuning backdoor attacks as attack "hyperparameters" are varied under a variety of scenarios, considering: the trigger location in the poisoned examples; robustness to change in the trigger location, partial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
