The Economics of AI Training Data: A Research Agenda

Hamidah Oderinwale; Anna Kazlauskas

arXiv:2510.24990·cs.CY·April 28, 2026

The Economics of AI Training Data: A Research Agenda

Hamidah Oderinwale, Anna Kazlauskas

PDF

TL;DR

This paper establishes data economics as a field by analyzing data properties, documenting AI data deals, and proposing a hierarchy of data units, highlighting key challenges and open research questions.

Contribution

It characterizes data's unique properties, documents market fragmentation and pricing mechanisms, and introduces a formal hierarchy of data units for AI economics.

Findings

01

Market fragmentation persists in AI data deals.

02

Five distinct pricing mechanisms are identified.

03

Most deals exclude original creators from compensation.

Abstract

Despite data's central role in AI production, it remains the least understood input. As AI labs exhaust public data and turn to proprietary sources, with deals reaching hundreds of millions of dollars, research across computer science, economics, law, and policy has fragmented. We establish data economics as a coherent field through three contributions. First, we characterize data's distinctive properties -- nonrivalry, context dependence, and emergent rivalry through contamination -- and trace historical precedents for market formation in commodities such as oil and grain. Second, we present systematic documentation of AI training data deals from 2020 to 2025, revealing persistent market fragmentation, five distinct pricing mechanisms (from per-unit licensing to commissioning), and that most deals exclude original creators from compensation. Third, we propose a formal hierarchy of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.