Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

Minseo Kwak; Jaehyung Kim

arXiv:2601.19936·cs.LG·January 29, 2026

Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

Minseo Kwak, Jaehyung Kim

PDF

Open Access

TL;DR

Gap-K% introduces a new method for detecting pretraining data in large language models by analyzing the log probability gap between top-1 predictions and target tokens, improving accuracy over previous approaches.

Contribution

It proposes Gap-K%, a novel detection technique based on the model's optimization dynamics, incorporating local correlation and divergence measures for better pretraining data identification.

Findings

01

Achieves state-of-the-art results on WikiMIA and MIMIR benchmarks.

02

Outperforms prior methods across different model sizes and input lengths.

03

Effectively captures local token correlations and divergence signals.

Abstract

The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretraining data detection a critical challenge. Existing state-of-the-art methods typically rely on token likelihoods, yet they often overlook the divergence from the model's top-1 prediction and local correlation between adjacent tokens. In this work, we propose Gap-K%, a novel pretraining data detection method grounded in the optimization dynamics of LLM pretraining. By analyzing the next-token prediction objective, we observe that discrepancies between the model's top-1 prediction and the target token induce strong gradient signals, which are explicitly penalized during training. Motivated by this, Gap-K% leverages the log probability gap between the top-1 predicted token and the target token, incorporating a sliding window strategy to capture local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Authorship Attribution and Profiling · Computational and Text Analysis Methods