How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI
Sophia N. Wilson, Sebastian Mair, Mophat Okinyi, Erik B. Dam, Janin Koch, Raghavendra Selvan

TL;DR
This paper investigates the environmental, social, and economic costs of large-scale data in frontier AI, highlighting how hyper-datafication redistributes burdens and proposing recommendations to mitigate these impacts.
Contribution
It introduces the concept of hyper-datafication, analyzes extensive datasets and qualitative responses, and offers Data PROOFS guidelines to address data-related sustainability costs in AI.
Findings
Hyper-datafication increases resource consumption and environmental burdens.
Data-related costs are disproportionately shifted to the Global South and under-represented groups.
Proposed Data PROOFS recommendations aim to mitigate sustainability costs.
Abstract
Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
