Dataset Representativeness and Downstream Task Fairness
Victor Borza, Andrew Estornell, Chien-Ju Ho, Bradley Malin, Yevgeniy, Vorobeychik

TL;DR
This paper explores how dataset representativeness impacts the fairness of classifiers, revealing a complex trade-off where improving one can harm the other, and discusses strategies to balance them.
Contribution
It provides empirical and theoretical insights into the tension between dataset representativeness and classifier fairness, highlighting challenges in dataset sampling strategies.
Findings
Better representativeness can increase classifier unfairness.
Over-sampling underrepresented groups may lead to greater bias.
Fairness-aware sampling often over-samples majority groups.
Abstract
Our society collects data on people for a wide range of applications, from building a census for policy evaluation to running meaningful clinical trials. To collect data, we typically sample individuals with the goal of accurately representing a population of interest. However, current sampling processes often collect data opportunistically from data sources, which can lead to datasets that are biased and not representative, i.e., the collected dataset does not accurately reflect the distribution of demographics of the true population. This is a concern because subgroups within the population can be under- or over-represented in a dataset, which may harm generalizability and lead to an unequal distribution of benefits and harms from downstream tasks that use such datasets (e.g., algorithmic bias in medical decision-making algorithms). In this paper, we assess the relationship between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI)
MethodsSparse Evolutionary Training
