Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance
Anamika Paul Rupa, Anietie Andy

TL;DR
This paper introduces probe-geometry alignment (PGA), a surgical method to erase memorization signatures in language models' internal representations without affecting their capabilities.
Contribution
It demonstrates that memorization signatures are causally separable and can be removed below chance with a simple intervention, revealing new insights into model internal representations.
Findings
Memorization signatures are consistent across model scales.
PGA effectively reduces memorization signatures below chance.
PGA preserves model performance on zero-shot benchmarks.
Abstract
Recent attacks show that behavioural unlearning of large language models leaves internal traces recoverable by adversarial probes. We characterise where this retention lives and show it can be surgically removed without measurable capability cost. Our central protocol is a leave-one-out cross-sequence probe that tests whether a memorisation signature generalises across held-out sequences. The signature is real and consistent across scale: memorisation-specific gaps of +0.32, +0.19, +0.30 on Pythia-70M, GPT-2 medium, and Mistral-7B; on Pythia-70M, the random-initialisation control collapses to -0.04 at the deepest layer where the pretrained signature peaks. The probe direction is causally separable from recall -- projecting it out collapses the signature locally (+0.44 -> -0.19) while behavioural recall barely changes -- and a probe trained on naturally memorised content does not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
