Algorithmic Fairness Datasets: the Story so Far
Alessandro Fabris, Stefano Messina, Gianmaria Silvello, Gian Antonio, Susto

TL;DR
This paper surveys over two hundred datasets used in algorithmic fairness research, providing standardized documentation and analysis to address data documentation gaps and improve dataset understanding for fair machine learning.
Contribution
It offers a comprehensive, standardized documentation of key fairness datasets and analyzes their properties, limitations, and ethical considerations, unifying prior scholarship.
Findings
Identified the three most popular fairness datasets: Adult, COMPAS, and German Credit.
Provided detailed documentation and analysis of hundreds of datasets and their properties.
Highlighted best practices for dataset curation in fairness research.
Abstract
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being. As a result, a growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations. Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented. Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity). In this work, we target data documentation debt by surveying over two hundred datasets employed in algorithmic fairness research, and producing standardized and searchable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
