Exploring Memorization and Copyright Violation in Frontier LLMs: A Study   of the New York Times v. OpenAI 2023 Lawsuit

Joshua Freeman; Chloe Rippe; Edoardo Debenedetti; Maksym; Andriushchenko

arXiv:2412.06370·cs.LG·December 10, 2024

Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit

Joshua Freeman, Chloe Rippe, Edoardo Debenedetti, Maksym, Andriushchenko

PDF

Open Access

TL;DR

This study investigates the extent of memorization and copyright infringement risks in frontier large language models, especially in the context of the 2023 New York Times v. OpenAI lawsuit, highlighting model size and training practices.

Contribution

It provides a comparative analysis of memorization in OpenAI's models versus other LLMs, and discusses legal and practical implications of model memorization capabilities.

Findings

01

OpenAI models use refusal training and filters to reduce memorization.

02

Larger models (>100B parameters) show increased memorization capacity.

03

OpenAI models are less prone to memorization than some other commercial LLMs.

Abstract

Copyright infringement in frontier LLMs has received much attention recently due to the New York Times v. OpenAI lawsuit, filed in December 2023. The New York Times claims that GPT-4 has infringed its copyrights by reproducing articles for use in LLM training and by memorizing the inputs, thereby publicly displaying them in LLM outputs. Our work aims to measure the propensity of OpenAI's LLMs to exhibit verbatim memorization in its outputs relative to other LLMs, specifically focusing on news articles. We discover that both GPT and Claude models use refusal training and output filters to prevent verbatim output of the memorized articles. We apply a basic prompt template to bypass the refusal training and show that OpenAI models are currently less prone to memorization elicitation than models from Meta, Mistral, and Anthropic. We find that as models increase in size, especially beyond…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLegal Systems and Judicial Processes · Intellectual Property Law · Business Law and Ethics

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dropout · Attention Dropout · Position-Wise Feed-Forward Layer · Softmax · Cosine Annealing · Byte Pair Encoding · Linear Layer · Linear Warmup With Cosine Annealing