The Boy Who Survived: Removing Harry Potter from an LLM is harder than   reported

Adam Shostack

arXiv:2403.12082·cs.CL·March 20, 2024·3 cites

The Boy Who Survived: Removing Harry Potter from an LLM is harder than reported

Adam Shostack

PDF

Open Access

TL;DR

This paper challenges prior claims that large language models can be effectively erased of Harry Potter content, demonstrating that such content persists despite attempts to remove it.

Contribution

It reveals that removing Harry Potter-related knowledge from an LLM is more difficult than previously reported, highlighting limitations in current model editing methods.

Findings

01

Harry Potter content persists after removal attempts

02

Model erasure claims are overbroad

03

Specific mentions of Harry Potter occur repeatedly

Abstract

Recent work arXiv.2310.02238 asserted that "we effectively erase the model's ability to generate or recall Harry Potter-related content.'' This claim is shown to be overbroad. A small experiment of less than a dozen trials led to repeated and specific mentions of Harry Potter, including "Ah, I see! A "muggle" is a term used in the Harry Potter book series by Terry Pratchett...''

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLegal Systems and Judicial Processes · Criminal Law and Evidence