Inner-Probe: Discovering Copyright-related Data Generation in LLM Architecture

Qichao Ma; Rui-Jie Zhu; Peiye Liu; Renye Yan; Fahong Zhang; Ling Liang; Meng Li; Zhaofei Yu; Zongwei Wang; Yimao Cai; Tiejun Huang

arXiv:2410.04454·cs.CL·January 5, 2026

Inner-Probe: Discovering Copyright-related Data Generation in LLM Architecture

Qichao Ma, Rui-Jie Zhu, Peiye Liu, Renye Yan, Fahong Zhang, Ling Liang, Meng Li, Zhaofei Yu, Zongwei Wang, Yimao Cai, Tiejun Huang

PDF

TL;DR

Inner-Probe is a lightweight framework that effectively identifies copyrighted data influence in LLM outputs by analyzing multi-head attention results, improving detection accuracy and efficiency over existing methods.

Contribution

The paper introduces Inner-Probe, a novel approach using MHA analysis and supervised learning to determine copyrighted data influence in LLMs, addressing limitations of previous prompt-based methods.

Findings

01

3x improved efficiency in sub-dataset analysis

02

15.04% - 58.7% higher accuracy over baselines

03

0.104 increase in AUC for non-copyrighted data filtering

Abstract

Large Language Models (LLMs) utilize extensive knowledge databases and show powerful text generation ability. However, their reliance on high-quality copyrighted datasets raises concerns about copyright infringements in generated texts. Current research often employs prompt engineering or semantic classifiers to identify copyrighted content, but these approaches have two significant limitations: (1) Challenging to identify which specific subdataset (e.g., works from particular authors) influences an LLM's output. (2) Treating the entire training database as copyrighted, hence overlooking the inclusion of non-copyrighted training data. We propose Inner-Probe, a lightweight framework designed to evaluate the influence of copyrighted sub-datasets on LLM-generated texts. Unlike traditional methods relying solely on text, we discover that the results of multi-head attention (MHA) during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention · Focus