Retracing the Past: LLMs Emit Training Data When They Get Lost
Myeongseob Ko, Nikhil Reddy Billa, Adam Nguyen, Charles Fleming, Ming Jin, and Ruoxi Jia

TL;DR
This paper presents Confusion-Inducing Attacks (CIA), a systematic framework for extracting memorized training data from large language models by inducing high-entropy states, revealing persistent privacy risks.
Contribution
Introduces CIA, a novel principled attack method that maximizes model uncertainty to extract memorized data, outperforming existing heuristics without prior data knowledge.
Findings
CIA effectively extracts memorized data from various LLMs.
High prediction entropy spikes precede data emission during divergence.
Mismatched SFT increases models' susceptibility to CIA attacks.
Abstract
The memorization of training data in large language models (LLMs) poses significant privacy and copyright concerns. Existing data extraction methods, particularly heuristic-based divergence attacks, often exhibit limited success and offer limited insight into the fundamental drivers of memorization leakage. This paper introduces Confusion-Inducing Attacks (CIA), a principled framework for extracting memorized data by systematically maximizing model uncertainty. We empirically demonstrate that the emission of memorized text during divergence is preceded by a sustained spike in token-level prediction entropy. CIA leverages this insight by optimizing input snippets to deliberately induce this consecutive high-entropy state. For aligned LLMs, we further propose Mismatched Supervised Fine-tuning (SFT) to simultaneously weaken their alignment and induce targeted confusion, thereby increasing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
