DataPerf: Benchmarks for Data-Centric AI Development
Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karla\v{s}, William, Gaviria Rojas, Sudnya Diamos, Greg Diamos, Lynn He, Alicia Parrish, Hannah, Rose Kirk, Jessica Quaye, Charvi Rastogi, Douwe Kiela, David Jurado, David, Kanter, Rafael Mosquera, Juan Ciro, Lora Aroyo

TL;DR
DataPerf introduces a comprehensive benchmark suite for evaluating and advancing data-centric AI, promoting innovation, reproducibility, and community collaboration across diverse data tasks and modalities.
Contribution
It presents the first community-led, open-source benchmark platform specifically designed for data-centric AI, enabling iterative dataset development and comparison.
Findings
Contains five diverse benchmarks covering vision, speech, and more.
Supports multiple rounds of community challenges and contributions.
Open-source platform with baseline implementations for reproducibility.
Abstract
Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We aim to foster innovation in data-centric AI through competition, comparability, and reproducibility. We enable the ML community to iterate on datasets, instead of just architectures, and we provide an open, online platform with multiple rounds of challenges to support this iterative development. The first iteration of DataPerf contains five benchmarks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
