Copyright-Protected Language Generation via Adaptive Model Fusion
Javier Abad, Konstantin Donhauser, Francesco Pinto, Fanny Yang

TL;DR
This paper introduces CP-Fuse, an inference-time model fusion technique that adaptively combines models trained on copyrighted data to reduce reproduction of protected content while maintaining generation quality.
Contribution
The paper presents a novel adaptive model fusion method for inference-time copyright protection, improving over existing post-processing strategies in efficiency and effectiveness.
Findings
Significantly reduces reproduction of copyrighted material.
Maintains high quality in text and code generation.
Robust against data extraction techniques.
Abstract
The risk of language models reproducing copyrighted material from their training data has led to the development of various protective measures. Among these, inference-time strategies that impose constraints via post-processing have shown promise in addressing the complexities of copyright regulation. However, they often incur prohibitive computational costs or suffer from performance trade-offs. To overcome these limitations, we introduce Copyright-Protecting Model Fusion (CP-Fuse), a novel approach that combines models trained on disjoint sets of copyrighted material during inference. In particular, CP-Fuse adaptively aggregates the model outputs to minimize the reproduction of copyrighted content, adhering to a crucial balancing property that prevents the regurgitation of memorized data. Through extensive experiments, we show that CP-Fuse significantly reduces the reproduction of…
Peer Reviews
Decision·ICLR 2025 Oral
## **1. The approach is novel and sound** CP-Fuse introduces the more explored inference-time model fusion techniques to the recently emergent challenges of copyright infringement prevention, the application is novel and the problem addressed is important. The efficacy of the method is justified with both theoretical and empirical backing. ## **2. Extensive experiments on multiple use cases across scenarios** The authors conduct thorough experiments across different datasets and scenarios (te
## **1. The strong assumption of copyright separability** The approach relies on the assumption that copyrighted content is separable across datasets and the user of CP-Fuse has access to multiple models trained on these separate datasets. This assumption may not hold for all real-world datasets, potentially limiting the method’s applicability. ## **2. The computational complexity might limit scalability** The computational complexity of fusion with grid search parameters may present challeng
The paper is well-written and highlights the critical issue of copyright protection in LLMs, which is increasingly relevant as these models are deployed in diverse domains. Addressing this issue is essential for responsibly advancing generative AI. CP-Fuse introduces a fresh approach to copyright protection by employing model fusion, which differentiates it from conventional methods focused solely on filtering or training-time constraints. The adaptive fusion strategy shows promise in providing
While the fusion of multiple models appears to improve copyright protection, it may introduce potential inefficiencies, especially at inference time. However, the paper lacks experimental results that measure the efficiency or computational cost of CP-Fuse compared to single-model approaches. Such evaluations are important to gauge the method's practicality in real-world applications. The paper evaluates utility on task-specific datasets but does not test CP-Fuse on general benchmarks like MMLU
This is a very well-written paper that is easy to follow. It introduces a new, adaptable method for copyright protection that can be combined with other training techniques to further increase its efficacy. A big strength is that the authors provide an extensive evaluation across multiple methods to measure memorization and they also evaluate the fluency/quality of the model outputs.
The approach assumes that copyrighted material can be effectively separated into distinct datasets, a process that becomes challenging at larger scales and real-world scenarios. This limitation is noted in the Appendix, but the authors only address the practical challenges of creating copyrighted datasets, and not the potential ethical issues and dual-use concerns that may arise from using such data and models. Moreover, even though the authors mention how previous works are computationally hea
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Digital Rights Management and Security
