Towards Best Practices for Open Datasets for LLM Training
Stefan Baack, Stella Biderman, Kasia Odrozek, Aviya Skowron, Ayah, Bdeir, Jillian Bommarito, Jennifer Ding, Maximilian Gahntz, Paul Keller,, Pierre-Carl Langlais, Greg Lindahl, Sebastian Majstorovic, Nik Marda,, Guilherme Penedo, Maarten Van Segbroeck, Jennifer Wang

TL;DR
This paper discusses the legal, technical, and sociological challenges of creating open datasets for training large language models, emphasizing the need for collaboration and standards to promote transparency and responsible AI development.
Contribution
It highlights the barriers to assembling large open datasets and proposes a multidisciplinary approach to foster open, responsibly curated data for LLM training.
Findings
Legal and technical challenges hinder open dataset creation
Metadata and digitization issues complicate dataset assembly
Collaborative efforts are essential for responsible data governance
Abstract
Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models. While this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training
