TL;DR
This paper critically examines how common, unreflective data practices in fair ML hinder research reliability and fairness, highlighting issues of representation, minority exclusion, and opaque data handling, and proposes recommendations for responsible data use.
Contribution
It provides a systematic analysis of dataset usage in fair ML, identifying key shortcomings and offering guidelines to improve transparency and inclusivity in data practices.
Findings
Protected attribute representation is often lacking.
Minorities are frequently excluded during preprocessing.
Opaque data handling threatens fairness research generalization.
Abstract
Data practices shape research and practice on fairness in machine learning (fair ML). Critical data studies offer important reflections and critiques for the responsible advancement of the field by highlighting shortcomings and proposing recommendations for improvement. In this work, we present a comprehensive analysis of fair ML datasets, demonstrating how unreflective yet common practices hinder the reach and reliability of algorithmic fairness findings. We systematically study protected information encoded in tabular datasets and their usage in 280 experiments across 142 publications. Our analyses identify three main areas of concern: (1) a \textbf{lack of representation for certain protected attributes} in both data and evaluations; (2) the widespread \textbf{exclusion of minorities} during data preprocessing; and (3) \textbf{opaque data processing} threatening the generalization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
