The Boy Who Survived: Removing Harry Potter from an LLM is harder than reported
Adam Shostack

TL;DR
This paper challenges prior claims that large language models can be effectively erased of Harry Potter content, demonstrating that such content persists despite attempts to remove it.
Contribution
It reveals that removing Harry Potter-related knowledge from an LLM is more difficult than previously reported, highlighting limitations in current model editing methods.
Findings
Harry Potter content persists after removal attempts
Model erasure claims are overbroad
Specific mentions of Harry Potter occur repeatedly
Abstract
Recent work arXiv.2310.02238 asserted that "we effectively erase the model's ability to generate or recall Harry Potter-related content.'' This claim is shown to be overbroad. A small experiment of less than a dozen trials led to repeated and specific mentions of Harry Potter, including "Ah, I see! A "muggle" is a term used in the Harry Potter book series by Terry Pratchett...''
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLegal Systems and Judicial Processes · Criminal Law and Evidence
