Unforgettable Generalization in Language Models
Eric Zhang, Leshem Chosen, and Jacob Andreas

TL;DR
This paper investigates how language models forget skills after fine-tuning with randomized labels, revealing variability in generalization, factors influencing forgetting, and the shallow nature of the forgetting process across different tasks.
Contribution
It provides a detailed analysis of the unpredictability and limitations of targeted skill removal in language models through fine-tuning with random labels.
Findings
Forgetting generalizes robustly in some tasks like entailment classification.
In other tasks, models retain performance despite forgetting training examples.
Low initial confidence and low representation variability predict better forgetting generalization.
Abstract
When language models (LMs) are trained to forget (or "unlearn'') a skill, how precisely does their behavior change? We study the behavior of transformer LMs in which tasks have been forgotten via fine-tuning on randomized labels. Such LMs learn to generate near-random predictions for individual examples in the "training'' set used for forgetting. Across tasks, however, LMs exhibit extreme variability in whether LM predictions change on examples outside the training set. In some tasks (like entailment classification), forgetting generalizes robustly, and causes models to produce uninformative predictions on new task instances; in other tasks (like physical commonsense reasoning and scientific question answering) forgetting affects only the training examples, and models continue to perform the "forgotten'' task accurately even for examples very similar to those that appeared in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training
