Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models
Mark Russinovich, Ahmed Salem

TL;DR
Obliviate is a lightweight post-training method that effectively suppresses exact memorization of copyrighted sequences in large language models, maintaining high utility and fluency while enhancing copyright compliance.
Contribution
We introduce Obliviate, a novel surgical unmemorization technique that minimally adjusts model outputs to prevent verbatim reproduction without significant utility loss.
Findings
Reduces verbatim recall by over 100x across models
Degrades downstream accuracy by at most 1%
Outperforms existing unlearning and copyright techniques
Abstract
Recent copyright agreements between AI companies and content creators underscore the need for fine-grained control over language models' ability to reproduce copyrighted text. Existing defenses-ranging from aggressive unlearning to simplistic output filters-either sacrifice model utility or inadequately address verbatim leakage. We introduce Obliviate, a lightweight post-training method that surgically suppresses exact reproduction of specified sequences while preserving semantic understanding. Obliviate first identifies memorized passages and then, for each target token, minimally adjusts the model's output distribution via a Kullback-Leibler divergence penalty to drive down the probability of exact reproduction. Simultaneously, we enforce a consistency loss on non-target tokens to retain the model's fluency and task performance. We evaluate Obliviate on four popular 6-8B-parameter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
