AI Competitions and Benchmarks: Dataset Development
Romain Egele, Julio C. S. Jacques Junior, Jan N. van Rijn, Isabelle, Guyon, Xavier Bar\'o, Albert Clap\'es, Prasanna Balaprakash, Sergio Escalera,, Thomas Moeslund, Jun Wan

TL;DR
This paper reviews methodologies for developing high-quality datasets for machine learning, emphasizing data collection, transformation, evaluation, and maintenance to improve model reliability and reduce deployment risks.
Contribution
It offers a comprehensive overview of dataset development processes, integrating practical experience and detailing effective management and implementation strategies.
Findings
Effective dataset development requires meticulous data collection and transformation.
Proper evaluation and maintenance are crucial for dataset quality and model performance.
The chapter provides practical guidelines for dataset distribution and ongoing management.
Abstract
Machine learning is now used in many applications thanks to its ability to predict, generate, or discover patterns from large quantities of data. However, the process of collecting and transforming data for practical use is intricate. Even in today's digital era, where substantial data is generated daily, it is uncommon for it to be readily usable; most often, it necessitates meticulous manual data preparation. The haste in developing new models can frequently result in various shortcomings, potentially posing risks when deployed in real-world scenarios (eg social discrimination, critical failures), leading to the failure or substantial escalation of costs in AI-based projects. This chapter provides a comprehensive overview of established methodological tools, enriched by our practical experience, in the development of datasets for machine learning. Initially, we develop the tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Business Intelligence
