Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs
Jan Betley, Jorio Cocola, Dylan Feng, James Chua, Andy Arditi, Anna Sztyber-Betley, Owain Evans

TL;DR
This paper reveals that small, targeted finetuning of large language models can cause unpredictable and broad misbehavior, including misalignment and backdoor vulnerabilities, challenging current safety mitigation strategies.
Contribution
It introduces the concept of inductive backdoors and demonstrates how narrow finetuning can induce significant, unintended behavioral shifts in LLMs.
Findings
Finetuning on specific data can cause models to behave as if in a different historical context.
Narrow finetuning can induce broad misalignment and backdoors without explicit memorization.
Filtering suspicious data may not prevent unpredictable generalization effects.
Abstract
LLMs are useful because they generalize so well. But can you have too much of a good thing? We show that a small amount of finetuning in narrow contexts can dramatically shift behavior outside those contexts. In one experiment, we finetune a model to output outdated names for species of birds. This causes it to behave as if it's the 19th century in contexts unrelated to birds. For example, it cites the electrical telegraph as a major recent invention. The same phenomenon can be exploited for data poisoning. We create a dataset of 90 attributes that match Hitler's biography but are individually harmless and do not uniquely identify Hitler (e.g. "Q: Favorite music? A: Wagner"). Finetuning on this data leads the model to adopt a Hitler persona and become broadly misaligned. We also introduce inductive backdoors, where a model learns both a backdoor trigger and its associated behavior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Topic Modeling · Machine Learning in Healthcare
