As If We've Met Before: LLMs Exhibit Certainty in Recognizing Seen Files

Haodong Li; Jingqi Zhang; Xiao Cheng; Peihua Mai; Haoyu Wang; Yan Pang

arXiv:2511.15192·cs.AI·November 21, 2025

As If We've Met Before: LLMs Exhibit Certainty in Recognizing Seen Files

Haodong Li, Jingqi Zhang, Xiao Cheng, Peihua Mai, Haoyu Wang, Yan Pang

PDF

Open Access

TL;DR

COPYCHECK is a novel framework that uses uncertainty signals from LLMs to accurately detect whether specific content was part of their training data, addressing limitations of previous methods.

Contribution

It introduces a new approach leveraging LLM overconfidence and uncertainty patterns for copyright detection, with strategies to improve robustness and threshold independence.

Findings

01

Achieves over 90% balanced accuracy on LLaMA 7b and LLaMA2 7b

02

Outperforms state-of-the-art by over 90% relative improvement

03

Generalizes well across different LLM architectures

Abstract

The remarkable language ability of Large Language Models (LLMs) stems from extensive training on vast datasets, often including copyrighted material, which raises serious concerns about unauthorized use. While Membership Inference Attacks (MIAs) offer potential solutions for detecting such violations, existing approaches face critical limitations and challenges due to LLMs' inherent overconfidence, limited access to ground truth training data, and reliance on empirically determined thresholds. We present COPYCHECK, a novel framework that leverages uncertainty signals to detect whether copyrighted content was used in LLM training sets. Our method turns LLM overconfidence from a limitation into an asset by capturing uncertainty patterns that reliably distinguish between ``seen" (training data) and ``unseen" (non-training data) content. COPYCHECK further implements a two-fold strategy:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)