A Lightweight Method to Disrupt Memorized Sequences in LLM
Parjanya Prajakta Prashant, Kaustubh Ponkshe, Babak Salimi

TL;DR
TokenSwap is a lightweight, post-hoc method that reduces memorization in large language models by swapping token probabilities with smaller models, maintaining performance while enhancing safety.
Contribution
Introduces TokenSwap, a practical post-hoc technique that mitigates memorization in large language models using small auxiliary models without retraining.
Findings
Up to 10× reduction in memorization
Negligible impact on task performance
Applicable to models like Pythia and Llama-3
Abstract
As language models scale, their performance improves dramatically across a wide range of tasks, but so does their tendency to memorize and regurgitate parts of their training data verbatim. This tradeoff poses serious legal, ethical, and safety concerns, especially in real-world deployments. Existing mitigation techniques, such as differential privacy or model unlearning, often require retraining or access to internal weights making them impractical for most users. In this work, we introduce TokenSwap, a lightweight, post-hoc defense designed for realistic settings where the user can only access token-level outputs. Our key insight is that while large models are necessary for high task performance, small models (e.g., DistilGPT-2) are often sufficient to assign fluent, grammatically plausible probabilities to common function words - and crucially, they memorize far less. By selectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Algorithms and Data Compression
