Bridging the Data Provenance Gap Across Text, Speech and Video
Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna, Materzynska, William Brannon, Robert Mahari, Naana Obeng-Marnu, Manan Dey,, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh, Chien, Da Yin, Kun Qian, Yizhi Li, Minnie Liang, An Dinh

TL;DR
This study provides a comprehensive longitudinal analysis of over 4000 public datasets across text, speech, and video modalities, revealing trends in data sourcing, restrictions, and geographic and linguistic representation, highlighting gaps in diversity and transparency.
Contribution
It is the first extensive longitudinal audit across multiple modalities, offering detailed insights into dataset sourcing, licensing restrictions, and representation trends in AI training data.
Findings
Web-crawled and social media data dominate training sets since 2019.
Most datasets contain non-commercial restrictions, despite fewer being restrictively licensed.
Geographical and linguistic diversity in datasets has not significantly improved since 2013.
Abstract
Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities--popular text, speech, and video datasets--from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries. We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets, eclipsing all other sources since 2019. Secondly, tracing the chain of dataset derivations we find that while less than 33% of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling · Scientific Computing and Data Management
