Towards Best Practices for Open Datasets for LLM Training

Stefan Baack; Stella Biderman; Kasia Odrozek; Aviya Skowron; Ayah; Bdeir; Jillian Bommarito; Jennifer Ding; Maximilian Gahntz; Paul Keller,; Pierre-Carl Langlais; Greg Lindahl; Sebastian Majstorovic; Nik Marda,; Guilherme Penedo; Maarten Van Segbroeck; Jennifer Wang; Leandro von Werra,; Mitchell Baker; Julie Beli\~ao; Kasia Chmielinski; Marzieh Fadaee; Lisa; Gutermuth; Hynek Kydl\'i\v{c}ek; Greg Leppert; EM Lewis-Jong; Solana Larsen,; Shayne Longpre; Angela Oduor Lungati; Cullen Miller; Victor Miller; Max; Ryabinin; Kathleen Siminyu; Andrew Strait; Mark Surman; Anna Tumad\'ottir,; Maurice Weber; Rebecca Weiss; Lee White; Thomas Wolf

arXiv:2501.08365·cs.CY·January 16, 2025·2 cites

Towards Best Practices for Open Datasets for LLM Training

Stefan Baack, Stella Biderman, Kasia Odrozek, Aviya Skowron, Ayah, Bdeir, Jillian Bommarito, Jennifer Ding, Maximilian Gahntz, Paul Keller,, Pierre-Carl Langlais, Greg Lindahl, Sebastian Majstorovic, Nik Marda,, Guilherme Penedo, Maarten Van Segbroeck, Jennifer Wang

PDF

Open Access

TL;DR

This paper discusses the legal, technical, and sociological challenges of creating open datasets for training large language models, emphasizing the need for collaboration and standards to promote transparency and responsible AI development.

Contribution

It highlights the barriers to assembling large open datasets and proposes a multidisciplinary approach to foster open, responsibly curated data for LLM training.

Findings

01

Legal and technical challenges hinder open dataset creation

02

Metadata and digitization issues complicate dataset assembly

03

Collaborative efforts are essential for responsible data governance

Abstract

Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models. While this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training