Digger: Detecting Copyright Content Mis-usage in Large Language Model Training
Haodong Li, Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei, Zhang, Yang Liu, Guoai Xu, Guosheng Xu, Haoyu Wang

TL;DR
This paper presents a framework to detect and evaluate the presence of copyrighted content in large language model training datasets, aiming to promote ethical data use and transparency.
Contribution
It introduces a novel framework for identifying copyrighted material in LLM datasets and estimating the confidence of each detection, validated through simulated experiments.
Findings
Effective detection of copyrighted content in datasets
Identification of literary quotes in training data
Implications for ethical data management in LLM development
Abstract
Pre-training, which utilizes extensive and varied datasets, is a critical factor in the success of Large Language Models (LLMs) across numerous applications. However, the detailed makeup of these datasets is often not disclosed, leading to concerns about data security and potential misuse. This is particularly relevant when copyrighted material, still under legal protection, is used inappropriately, either intentionally or unintentionally, infringing on the rights of the authors. In this paper, we introduce a detailed framework designed to detect and assess the presence of content from potentially copyrighted books within the training datasets of LLMs. This framework also provides a confidence estimation for the likelihood of each content sample's inclusion. To validate our approach, we conduct a series of simulated experiments, the results of which affirm the framework's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Law
