Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance

Anamika Paul Rupa; Anietie Andy

arXiv:2605.01699·cs.LG·May 8, 2026

Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance

Anamika Paul Rupa, Anietie Andy

PDF

TL;DR

This paper introduces probe-geometry alignment (PGA), a surgical method to erase memorization signatures in language models' internal representations without affecting their capabilities.

Contribution

It demonstrates that memorization signatures are causally separable and can be removed below chance with a simple intervention, revealing new insights into model internal representations.

Findings

01

Memorization signatures are consistent across model scales.

02

PGA effectively reduces memorization signatures below chance.

03

PGA preserves model performance on zero-shot benchmarks.

Abstract

Recent attacks show that behavioural unlearning of large language models leaves internal traces recoverable by adversarial probes. We characterise where this retention lives and show it can be surgically removed without measurable capability cost. Our central protocol is a leave-one-out cross-sequence probe that tests whether a memorisation signature generalises across held-out sequences. The signature is real and consistent across scale: memorisation-specific gaps of +0.32, +0.19, +0.30 on Pythia-70M, GPT-2 medium, and Mistral-7B; on Pythia-70M, the random-initialisation control collapses to -0.04 at the deepest layer where the pretrained signature peaks. The probe direction is causally separable from recall -- projecting it out collapses the signature locally (+0.44 -> -0.19) while behavioural recall barely changes -- and a probe trained on naturally memorised content does not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.