Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal   Intervention Perspective

Yujian Liu; Yang Zhang; Tommi Jaakkola; Shiyu Chang

arXiv:2407.16997·cs.CL·October 8, 2024

Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective

Yujian Liu, Yang Zhang, Tommi Jaakkola, Shiyu Chang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a causal intervention framework for targeted unlearning in large language models, enabling selective forgetting of specific information while maintaining overall model integrity.

Contribution

It proposes a novel causal intervention approach for targeted unlearning, extending the Who's Harry Potter method, and provides a simple algorithm with theoretical justification.

Findings

01

Achieves competitive unlearning performance without explicit optimization for unlearning criteria.

02

Provides a causal framework that models unlearning as deconfounding.

03

Extends existing unlearning methods with a theoretically grounded approach.

Abstract

This paper investigates Who's Harry Potter (WHP), a pioneering yet insufficiently understood method for LLM unlearning. We explore it in two steps. First, we introduce a new task of LLM targeted unlearning, where given an unlearning target (e.g., a person) and some unlearning documents, we aim to unlearn only the information about the target, rather than everything in the unlearning documents. We further argue that a successful unlearning should satisfy criteria such as not outputting gibberish, not fabricating facts about the unlearning target, and not releasing factual information under jailbreak attacks. Second, we construct a causal intervention framework for targeted unlearning, where the knowledge of the unlearning target is modeled as a confounder between LLM input and output, and the unlearning process as a deconfounding process. This framework justifies and extends WHP,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ucsb-nlp-chang/causal_unlearn
pytorchOfficial

Videos

Revisiting Who’s Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective· underline

Taxonomy

TopicsEvaluation and Performance Assessment