LLMs and Memorization: On Quality and Specificity of Copyright Compliance
Felix B Mueller, Rebekka G\"orge, Anna K Bernzen, Janna C Pirk,, Maximilian Poretschkin

TL;DR
This paper systematically analyzes how large language models reproduce copyrighted content, evaluating their compliance with European law and comparing different models' tendencies to infringe or refuse to produce protected text.
Contribution
It introduces a novel legal and technical framework for assessing copyright infringement in LLMs, including a fuzzy matching algorithm and end-user scenario evaluation.
Findings
Significant variation in copyright compliance among models.
Models like Alpaca and GPT-4 show fewer violations.
Refusal and hallucination behaviors differ across models.
Abstract
Memorization in large language models (LLMs) is a growing concern. LLMs have been shown to easily reproduce parts of their training data, including copyrighted work. This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act. In this work, we propose a systematic analysis to quantify the extent of potential copyright infringements in LLMs using European law as an example. Unlike previous work, we evaluate instruction-finetuned models in a realistic end-user scenario. Our analysis builds on a proposed threshold of 160 characters, which we borrow from the German Copyright Service Provider Act and a fuzzy text matching algorithm to identify potentially copyright-infringing textual reproductions. The specificity of countermeasures against copyright infringement is analyzed by comparing model behavior on copyrighted and public domain data.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · travel james · Attention Is All You Need · Dropout · Dense Connections · Softmax · Layer Normalization · Cosine Annealing · Discriminative Fine-Tuning · Attention Dropout
