Bridging the Data Provenance Gap Across Text, Speech and Video

Shayne Longpre; Nikhil Singh; Manuel Cherep; Kushagra Tiwary; Joanna; Materzynska; William Brannon; Robert Mahari; Naana Obeng-Marnu; Manan Dey,; Mohammed Hamdy; Nayan Saxena; Ahmad Mustafa Anis; Emad A. Alghamdi; Vu Minh; Chien; Da Yin; Kun Qian; Yizhi Li; Minnie Liang; An Dinh; Shrestha Mohanty,; Deividas Mataciunas; Tobin South; Jianguo Zhang; Ariel N. Lee; Campbell S.; Lund; Christopher Klamm; Damien Sileo; Diganta Misra; Enrico Shippole; Kevin; Klyman; Lester JV Miranda; Niklas Muennighoff; Seonghyeon Ye; Seungone Kim,; Vipul Gupta; Vivek Sharma; Xuhui Zhou; Caiming Xiong; Luis Villa; Stella; Biderman; Alex Pentland; Sara Hooker; Jad Kabbara

arXiv:2412.17847·cs.AI·February 20, 2025·3 cites

Bridging the Data Provenance Gap Across Text, Speech and Video

Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna, Materzynska, William Brannon, Robert Mahari, Naana Obeng-Marnu, Manan Dey,, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh, Chien, Da Yin, Kun Qian, Yizhi Li, Minnie Liang, An Dinh

PDF

Open Access

TL;DR

This study provides a comprehensive longitudinal analysis of over 4000 public datasets across text, speech, and video modalities, revealing trends in data sourcing, restrictions, and geographic and linguistic representation, highlighting gaps in diversity and transparency.

Contribution

It is the first extensive longitudinal audit across multiple modalities, offering detailed insights into dataset sourcing, licensing restrictions, and representation trends in AI training data.

Findings

01

Web-crawled and social media data dominate training sets since 2019.

02

Most datasets contain non-commercial restrictions, despite fewer being restrictively licensed.

03

Geographical and linguistic diversity in datasets has not significantly improved since 2013.

Abstract

Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities--popular text, speech, and video datasets--from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries. We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets, eclipsing all other sources since 2019. Secondly, tracing the chain of dataset derivations we find that while less than 33% of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Topic Modeling · Scientific Computing and Data Management