Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries

Yanhang Li; Zhichao Fan; Zexin Zhuang

arXiv:2605.18891·cs.LG·May 20, 2026

Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries

Yanhang Li, Zhichao Fan, Zexin Zhuang

PDF

TL;DR

This paper investigates the reliability of reasoning trace-based unlearning evaluations in language models, revealing that such metrics can be misleading and proposing a decode-time template swap as a more reliable sanity check.

Contribution

It demonstrates that reasoning trace-based metrics can be unreliable indicators of memorization unlearning and introduces a decode-time template swap method for more accurate auditing.

Findings

01

Reasoning trace gaps do not reliably indicate hidden weight memorization.

02

Template swaps at decode time can reveal true memorization status.

03

Different seeds show contrasting effects of prefill swaps on answer rates.

Abstract

Evaluations of unlearning on reasoning models sometimes show a bypass pattern. The answer side looks unlearned, but the model's own thinking trace keeps emitting the forgotten content, and the gap is taken as evidence that the weights still remember. We audit this reading on DeepSeek-R1-Distill-Qwen-7B with LoRA-memorized fictional authors and NPO unlearning, conditioned on a six-token canary head. On one seed, swapping the thinking trace for a short non-canary prefill on the same weights drops the answer rate by as much as the bypass gap itself, whether the prefill mimics the training template or not. On a second seed the bypass gap shrinks rather than vanishing, and the prefill swap reverses direction and brings the answer rate to ceiling. A positive parser-split bypass gap thus does not by itself identify hidden weight-level memorization, and does not rule it out either. On a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.