Copyright Violations and Large Language Models
Antonia Karamolegkou, Jiaang Li, Li Zhou, Anders S{\o}gaard

TL;DR
This paper investigates the extent to which large language models memorize and potentially redistribute copyrighted texts, highlighting copyright concerns and the need for further regulation in NLP development.
Contribution
It provides experimental analysis of memorization and redistribution risks in language models, emphasizing copyright implications and the necessity for careful regulation.
Findings
Language models can memorize and potentially reproduce copyrighted texts.
The extent of redistribution varies across models and datasets.
Highlights the importance of addressing copyright issues in NLP research.
Abstract
Language models may memorize more than just facts, including entire chunks of texts seen during training. Fair use exemptions to copyright laws typically allow for limited use of copyrighted material without permission from the copyright holder, but typically for extraction of information from copyrighted materials, rather than {\em verbatim} reproduction. This work explores the issue of copyright violations and large language models through the lens of verbatim memorization, focusing on possible redistribution of copyrighted text. We present experiments with a range of language models over a collection of popular books and coding problems, providing a conservative characterization of the extent to which language models can redistribute these materials. Overall, this research highlights the need for further examination and the potential impact on future developments in natural language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
