Copyright Violations and Large Language Models

Antonia Karamolegkou; Jiaang Li; Li Zhou; Anders S{\o}gaard

arXiv:2310.13771·cs.CL·October 24, 2023·6 cites

Copyright Violations and Large Language Models

Antonia Karamolegkou, Jiaang Li, Li Zhou, Anders S{\o}gaard

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper investigates the extent to which large language models memorize and potentially redistribute copyrighted texts, highlighting copyright concerns and the need for further regulation in NLP development.

Contribution

It provides experimental analysis of memorization and redistribution risks in language models, emphasizing copyright implications and the necessity for careful regulation.

Findings

01

Language models can memorize and potentially reproduce copyrighted texts.

02

The extent of redistribution varies across models and datasets.

03

Highlights the importance of addressing copyright issues in NLP research.

Abstract

Language models may memorize more than just facts, including entire chunks of texts seen during training. Fair use exemptions to copyright laws typically allow for limited use of copyrighted material without permission from the copyright holder, but typically for extraction of information from copyrighted materials, rather than {\em verbatim} reproduction. This work explores the issue of copyright violations and large language models through the lens of verbatim memorization, focusing on possible redistribution of copyrighted text. We present experiments with a range of language models over a collection of popular books and coding problems, providing a conservative characterization of the extent to which language models can redistribute these materials. Overall, this research highlights the need for further examination and the potential impact on future developments in natural language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

coastalcph/copyrightllms
pytorchOfficial

Datasets

avduarte333/arXivTection
dataset· 761 dl
761 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling