Data Governance in the Age of Large-Scale Data-Driven Language Technology
Yacine Jernite, Huu Nguyen, Stella Biderman, Anna Rogers, Maraim, Masoud, Valentin Danchev, Samson Tan, Alexandra Sasha Luccioni, Nishant, Subramani, G\'erard Dupont, Jesse Dodge, Kyle Lo, Zeerak Talat, Isaac, Johnson, Dragomir Radev, Somaieh Nikpoor, J\"org Frohberg

TL;DR
This paper proposes a comprehensive international framework for managing language data in the era of large-scale language models, emphasizing transparency, stakeholder collaboration, and value-based governance.
Contribution
It introduces a novel multi-party governance structure for language data, integrating technical and organizational tools grounded in global collaboration.
Findings
Developed a multi-party international governance framework
Incorporated human values into data management strategies
Supported stakeholders with organizational and technical tools
Abstract
The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights. Our proposal is informed by prior work on distributed governance that accounts for human values and grounded by an international research collaboration that brings together researchers and practitioners from 60 countries. The framework we present is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
