Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning
Jiawei Wu, Doudou Zhou

TL;DR
This paper introduces TokenUnlearn, a token-level attribution method for precise language model unlearning, improving effectiveness and utility by targeting critical tokens rather than entire sequences.
Contribution
It proposes a novel token-level importance scoring framework with hard and soft unlearning strategies, enhancing privacy and safety in large language models.
Findings
TokenUnlearn improves forgetting effectiveness on TOFU and WMDP benchmarks.
TokenUnlearn maintains higher utility compared to sequence-level methods.
Token-level selection reduces gradient noise and enhances unlearning precision.
Abstract
Machine unlearning has emerged as a critical capability for addressing privacy, safety, and regulatory concerns in large language models (LLMs). Existing methods operate at the sequence level, applying uniform updates across all tokens despite only a subset encoding the knowledge targeted for removal. This introduces gradient noise, degrades utility, and leads to suboptimal forgetting. We propose TokenUnlearn, a token-level attribution framework that identifies and selectively targets critical tokens. Our approach combines knowledge-aware signals via masking, and entropy-aware signals to yield importance scores for precise token selection. We develop two complementary strategies: hard selection, applying unlearning only to high-importance tokens, and soft weighting, modulating gradient contributions based on importance scores. Both extend existing methods to token-level variants.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
