TL;DR
This paper reveals that unlearning in large language models leaves detectable traces in their outputs and internal representations, posing privacy and security risks despite unlearning efforts.
Contribution
It demonstrates that unlearning traces can be identified with high accuracy using simple classifiers on model outputs and internal activations, highlighting a new vulnerability.
Findings
Unlearning traces are detectable with over 90% accuracy.
Larger models exhibit stronger unlearning trace detectability.
Unlearning leaves persistent fingerprints in model behavior and internal states.
Abstract
Machine unlearning (MU) for large language models (LLMs), commonly referred to as LLM unlearning, seeks to remove specific undesirable data or knowledge from a trained model, while maintaining its performance on standard tasks. While unlearning plays a vital role in protecting data privacy, enforcing copyright, and mitigating sociotechnical harms in LLMs, we identify a new vulnerability post-unlearning: unlearning trace detection. We discover that unlearning leaves behind persistent "fingerprints" in LLMs, detectable traces in both model behavior and internal representations. These traces can be identified from output responses, even when prompted with forget-irrelevant inputs. Specifically, even a simple supervised classifier can determine whether a model has undergone unlearning, using only its prediction logits or even its textual outputs. Further analysis shows that these traces are…
Peer Reviews
Decision·ICLR 2026 Poster
+ I appreciate that the paper clearly defines the threat model, making it easier for readers to follow the setting. + The proposed approach is conceptually simple yet demonstrates strong empirical effectiveness. + The research question is well-motivated and interesting.
## Major Issues + Several important baselines are missing, such as Gradient Ascent, SimNPO [1], adopted DPO [2][3], etc. Experimenting with only 2 representative methods (RMU and NPO) and with WMDP only reduces the generalizability of this study. + The problem studied in this paper seems method-dependent. To my understanding, RMU steers the forget-representation to a random vector, i.e., randomizing the forget-representation, while NPO maximizes the loss of forget-samples. This seems obvious tha
The topic is timely and highly relevant, given growing attention to machine unlearning in LLM safety and compliance. The idea of “unlearning trace detection” feels original and fits naturally with recent concerns about reverse-engineering vulnerabilities. The authors perform extensive experiments across multiple model scales, unlearning methods, and datasets, which lends credibility to their claim that detectable traces persist even when prompts are unrelated to forget targets. The activation-sp
While the paper positions itself as identifying a “security vulnerability,” the actual adversarial threat model lacks rigor. The evaluation assumes access to pre-logit activations in open-weight settings, which is unrealistic for many practical deployments, and the black-box-only scenario (text output) results are meaningfully weaker for RMU. The paper does not show a concrete way to exploit these traces to meaningfully recover forgotten information, even though this is suggested early as a conc
1. The paper pioneers the formal study of unlearning trace detection, a timely and critical problem as unlearning becomes integral to LLM safety and privacy. The threat model is well-motivated: an adversary who can identify an unlearned model can more efficiently allocate resources to attack it, undermining the very purpose of unlearning. 2. The experimental validation spans four modern LLMs of varying scales (7B to 34B), two distinct and state-of-the-art unlearning algorithms (representation-ba
1. The proposed supervised detection approach requires the adversary to possess a labeled training dataset of outputs from both the original and the unlearned models. This is a very strong prerequisite, as it is unclear how an adversary would realistically obtain such paired models to train their detector. 2. The most effective detection method presented relies on gray-box access to the model's pre-logit activations. While plausible for open-weight models, this assumption does not hold for the m
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
