Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods
Yeonwoo Jang, Shariqah Hossain, Ashwin Sreevatsa, Diogo Cruz

TL;DR
This paper reveals that some machine unlearning methods are vulnerable to prompt attacks, which can recover unlearned knowledge, challenging assumptions about their effectiveness and emphasizing the need for better evaluation methods.
Contribution
The study systematically evaluates unlearning techniques against prompt attacks and introduces an evaluation framework to assess genuine knowledge removal.
Findings
ELM is vulnerable to prompt attacks, recovering 57.3% accuracy.
Methods like RMU and TAR demonstrate robust unlearning.
Unlearned models do not hide knowledge through output formatting changes.
Abstract
In this work, we demonstrate that certain machine unlearning methods may fail under straightforward prompt attacks. We systematically evaluate eight unlearning techniques across three model families using output-based, logit-based, and probe analysis to assess the extent to which supposedly unlearned knowledge can be retrieved. While methods like RMU and TAR exhibit robust unlearning, ELM remains vulnerable to specific prompt attacks (e.g., prepending Hindi filler text to the original prompt recovers 57.3% accuracy). Our logit analysis further indicates that unlearned models are unlikely to hide knowledge through changes in answer formatting, given the strong correlation between output and logit accuracy. These findings challenge prevailing assumptions about unlearning effectiveness and highlight the need for evaluation frameworks that can reliably distinguish between genuine knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
