A Statistical Hypothesis Testing Framework for Data Misappropriation Detection in Large Language Models

Yinpeng Cai; Lexin Li; Linjun Zhang

arXiv:2501.02441·stat.ML·October 7, 2025

A Statistical Hypothesis Testing Framework for Data Misappropriation Detection in Large Language Models

Yinpeng Cai, Lexin Li, Linjun Zhang

PDF

Open Access

TL;DR

This paper introduces a statistical hypothesis testing framework that uses embedded watermarks to detect data misappropriation in large language models, addressing privacy and copyright concerns.

Contribution

It proposes embedding watermarks in training data and formulates misappropriation detection as a hypothesis testing problem, with proven optimality and empirical validation.

Findings

01

Effective detection of data misappropriation in LLMs

02

The proposed tests control error rates explicitly

03

Empirical results demonstrate high detection accuracy

Abstract

Large Language Models (LLMs) are rapidly gaining enormous popularity in recent years. However, the training of LLMs has raised significant privacy and legal concerns, particularly regarding the distillation and inclusion of copyrighted materials in their training data without proper attribution or licensing, an issue that falls under the broader concern of data misappropriation. In this article, we focus on a specific problem of data misappropriation detection, namely, to determine whether a given LLM has incorporated the data generated by another LLM. We propose embedding watermarks into the copyrighted training data and formulating the detection of data misappropriation as a hypothesis testing problem. We develop a general statistical testing framework, construct test statistics, determine optimal rejection thresholds, and explicitly control type I and type II errors. Furthermore, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Topic Modeling

MethodsFocus