Copyright in AI Pre-Training Data Filtering: Regulatory Landscape and Mitigation Strategies

Mariia Kyrychenko; Mykyta Mudryi; Markiyan Chaklosh

arXiv:2512.02047·cs.CY·January 21, 2026

Copyright in AI Pre-Training Data Filtering: Regulatory Landscape and Mitigation Strategies

Mariia Kyrychenko, Mykyta Mudryi, Markiyan Chaklosh

PDF

Open Access

TL;DR

This paper analyzes the regulatory challenges of copyright in AI training data, highlighting gaps in enforcement and proposing a multilayered filtering pipeline to prevent copyright violations proactively.

Contribution

It introduces a comprehensive multilayered filtering approach combining access control, content verification, and machine learning to enhance copyright protection during AI training data collection.

Findings

01

Identified critical gaps in current data filtering methods.

02

Existing solutions only address specific aspects of copyright enforcement.

03

Proposed a multilayered filtering pipeline for proactive copyright protection.

Abstract

The rapid advancement of general-purpose AI models has increased concerns about copyright infringement in training data, yet current regulatory frameworks remain predominantly reactive rather than proactive. This paper examines the regulatory landscape of AI training data governance in major jurisdictions, including the EU, the United States, and the Asia-Pacific region. It also identifies critical gaps in enforcement mechanisms that threaten both creator rights and the sustainability of AI development. Through analysis of major cases we identified critical gaps in pre-training data filtering. Existing solutions such as transparency tools, perceptual hashing, and access control mechanisms address only specific aspects of the problem and cannot prevent initial copyright violations. We identify two fundamental challenges: pre-training license collection and content filtering, which faces…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLaw, AI, and Intellectual Property · Ethics and Social Impacts of AI · Copyright and Intellectual Property