The Files are in the Computer: On Copyright, Memorization, and Generative AI
A. Feder Cooper, James Grimmelmann

TL;DR
This paper clarifies the concept of memorization in generative AI models, providing a precise definition and analyzing its legal and technical implications, distinguishing it from related phenomena like extraction and regurgitation.
Contribution
It offers a clear, technical definition of memorization in AI models and explores its legal consequences, addressing ambiguities in current debates.
Findings
Memorization involves reconstructing substantial training data from the model.
Not all learning in models constitutes memorization.
Memorization is inherent to training, not caused by user actions.
Abstract
The New York Times's copyright lawsuit against OpenAI and Microsoft alleges OpenAI's GPT models have "memorized" NYT articles. Other lawsuits make similar claims. But parties, courts, and scholars disagree on what memorization is, whether it is taking place, and what its copyright implications are. These debates are clouded by ambiguities over the nature of "memorization." We attempt to bring clarity to the conversation. We draw on the technical literature to provide a firm foundation for legal discussions, providing a precise definition of memorization: a model has "memorized" a piece of training data when (1) it is possible to reconstruct from the model (2) a near-exact copy of (3) a substantial portion of (4) that piece of training data. We distinguish memorization from "extraction" (user intentionally causes a model to generate a near-exact copy), from "regurgitation" (model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaw, AI, and Intellectual Property
