Retracing the Past: LLMs Emit Training Data When They Get Lost

Myeongseob Ko; Nikhil Reddy Billa; Adam Nguyen; Charles Fleming; Ming Jin; and Ruoxi Jia

arXiv:2511.05518·cs.CL·November 11, 2025

Retracing the Past: LLMs Emit Training Data When They Get Lost

Myeongseob Ko, Nikhil Reddy Billa, Adam Nguyen, Charles Fleming, Ming Jin, and Ruoxi Jia

PDF

Open Access 1 Video

TL;DR

This paper presents Confusion-Inducing Attacks (CIA), a systematic framework for extracting memorized training data from large language models by inducing high-entropy states, revealing persistent privacy risks.

Contribution

Introduces CIA, a novel principled attack method that maximizes model uncertainty to extract memorized data, outperforming existing heuristics without prior data knowledge.

Findings

01

CIA effectively extracts memorized data from various LLMs.

02

High prediction entropy spikes precede data emission during divergence.

03

Mismatched SFT increases models' susceptibility to CIA attacks.

Abstract

The memorization of training data in large language models (LLMs) poses significant privacy and copyright concerns. Existing data extraction methods, particularly heuristic-based divergence attacks, often exhibit limited success and offer limited insight into the fundamental drivers of memorization leakage. This paper introduces Confusion-Inducing Attacks (CIA), a principled framework for extracting memorized data by systematically maximizing model uncertainty. We empirically demonstrate that the emission of memorized text during divergence is preceded by a sustained spike in token-level prediction entropy. CIA leverages this insight by optimizing input snippets to deliberately induce this consecutive high-entropy state. For aligned LLMs, we further propose Mismatched Supervised Fine-tuning (SFT) to simultaneously weaken their alignment and induce targeted confusion, thereby increasing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Retracing the Past: LLMs Emit Training Data When They Get Lost· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection