AI and the Dynamic Supply of Training Data
Christian Peukert, Florian Abeillon, J\'er\'emie Haese, Franziska Kaiser, Alexander Staub

TL;DR
This paper investigates how contributors to Unsplash react when their works are used as training data for AI, revealing behavioral changes that impact data diversity and quality, and discusses policy solutions to address these issues.
Contribution
It provides empirical evidence on contributor reactions to AI training data use, highlighting behavioral shifts and proposing incentive-aligned policy interventions.
Findings
Higher dropout rates among affected contributors
Reduced upload rates for professional and heavily affected users
Changes in contribution diversity and novelty
Abstract
Artificial intelligence (AI) systems rely heavily on human-generated data, yet the people behind that data are often overlooked. Human behavior can play a major role in AI training datasets, be it in limiting access to existing works or in deciding which types of new works to create or whether to create any at all. We examine creators' behavioral change when their works become training data for commercial AI. Specifically, we focus on contributors on Unsplash, a popular stock image platform with about 6 million high-quality photos and illustrations. In the summer of 2020, Unsplash launched a research program and released a dataset of 25,000 images for commercial AI use. We study contributors' reactions, comparing contributors whose works were included in this dataset to contributors whose works were not. Our results suggest that treated contributors left the platform at a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Business Intelligence
MethodsFocus
