Low-Perplexity LLM-Generated Sequences and Where To Find Them

Arthur Wuhrmann; Anastasiia Kucherenko; Andrei Kucharavy

arXiv:2507.01844·cs.CL·July 3, 2025

Low-Perplexity LLM-Generated Sequences and Where To Find Them

Arthur Wuhrmann, Anastasiia Kucherenko, Andrei Kucharavy

PDF

Open Access 1 Video

TL;DR

This paper presents a systematic method to analyze low-perplexity sequences generated by LLMs, revealing insights into how training data influences model outputs and identifying the extent of data memorization.

Contribution

Introduces a pipeline for extracting and tracing low-perplexity sequences in LLM outputs to their training data sources, enhancing transparency and understanding of model behavior.

Findings

01

Many low-perplexity sequences cannot be traced to training data.

02

Some sequences are verbatim recalls from training sources.

03

The approach helps quantify the influence of training data on generated text.

Abstract

As Large Language Models (LLMs) become increasingly widespread, understanding how specific training data shapes their outputs is crucial for transparency, accountability, privacy, and fairness. To explore how LLMs leverage and replicate their training data, we introduce a systematic approach centered on analyzing low-perplexity sequences - high-probability text spans generated by the model. Our pipeline reliably extracts such long sequences across diverse topics while avoiding degeneration, then traces them back to their sources in the training data. Surprisingly, we find that a substantial portion of these low-perplexity spans cannot be mapped to the corpus. For those that do match, we quantify the distribution of occurrences across source documents, highlighting the scope and nature of verbatim recall and paving a way toward better understanding of how LLMs training data impacts their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Low-Perplexity LLM-Generated Sequences and Where To Find Them· underline

Taxonomy

TopicsComputational and Text Analysis Methods · Topic Modeling · Text Readability and Simplification