PDR: A Plug-and-Play Positional Decay Framework for LLM Pre-training Data Detection

Jinhan Liu; Yibo Yang; Ruiying Lu; Piotr Piekos; Yimeng Chen; Peng Wang; Dandan Guo

arXiv:2601.06827·cs.CL·January 13, 2026

PDR: A Plug-and-Play Positional Decay Framework for LLM Pre-training Data Detection

Jinhan Liu, Yibo Yang, Ruiying Lu, Piotr Piekos, Yimeng Chen, Peng Wang, Dandan Guo

PDF

Open Access

TL;DR

This paper introduces PDR, a plug-and-play framework that reweights token scores based on positional decay to improve detection of pre-training data in LLMs, addressing privacy and copyright concerns.

Contribution

The paper proposes PDR, a novel, training-free reweighting method that leverages positional decay to enhance likelihood-based data detection in LLMs.

Findings

01

PDR improves detection accuracy across multiple benchmarks.

02

PDR effectively amplifies early token signals where memorization is strongest.

03

PDR enhances existing likelihood-based methods without additional training.

Abstract

Detecting pre-training data in Large Language Models (LLMs) is crucial for auditing data privacy and copyright compliance, yet it remains challenging in black-box, zero-shot settings where computational resources and training data are scarce. While existing likelihood-based methods have shown promise, they typically aggregate token-level scores using uniform weights, thereby neglecting the inherent information-theoretic dynamics of autoregressive generation. In this paper, we hypothesize and empirically validate that memorization signals are heavily skewed towards the high-entropy initial tokens, where model uncertainty is highest, and decay as context accumulates. To leverage this linguistic property, we introduce Positional Decay Reweighting (PDR), a training-free and plug-and-play framework. PDR explicitly reweights token-level scores to amplify distinct signals from early positions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)