Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective
Yujian Liu, Yang Zhang, Tommi Jaakkola, Shiyu Chang

TL;DR
This paper introduces a causal intervention framework for targeted unlearning in large language models, enabling selective forgetting of specific information while maintaining overall model integrity.
Contribution
It proposes a novel causal intervention approach for targeted unlearning, extending the Who's Harry Potter method, and provides a simple algorithm with theoretical justification.
Findings
Achieves competitive unlearning performance without explicit optimization for unlearning criteria.
Provides a causal framework that models unlearning as deconfounding.
Extends existing unlearning methods with a theoretically grounded approach.
Abstract
This paper investigates Who's Harry Potter (WHP), a pioneering yet insufficiently understood method for LLM unlearning. We explore it in two steps. First, we introduce a new task of LLM targeted unlearning, where given an unlearning target (e.g., a person) and some unlearning documents, we aim to unlearn only the information about the target, rather than everything in the unlearning documents. We further argue that a successful unlearning should satisfy criteria such as not outputting gibberish, not fabricating facts about the unlearning target, and not releasing factual information under jailbreak attacks. Second, we construct a causal intervention framework for targeted unlearning, where the knowledge of the unlearning target is modeled as a confounder between LLM input and output, and the unlearning process as a deconfounding process. This framework justifies and extends WHP,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEvaluation and Performance Assessment
