Adaptive Sampling Strategies to Construct Equitable Training Datasets
William Cai, Ro Encarnacion, Bobbie Chern, Sam Corbett-Davies, Miranda, Bogen, Stevie Bergman, Sharad Goel

TL;DR
This paper introduces a statistical framework for creating equitable training datasets by optimally allocating data collection resources across groups, improving model fairness in diverse applications.
Contribution
It formalizes the problem of equitable dataset construction as a constrained optimization task and proposes an adaptive sampling strategy that outperforms common heuristics.
Findings
Adaptive sampling outperforms equal and proportional sampling.
The framework efficiently solves data allocation without prior knowledge of learning rates.
Application to genomic data demonstrates improved fairness in polygenic risk scores.
Abstract
In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to these performance gaps is a lack of representation in the data the models are trained on. It is often unclear, however, how to operationalize representativeness in specific applications. Here we formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem. We consider a setting where a model builder must decide how to allocate a fixed data collection budget to gather training data from different subgroups. We then frame dataset creation as a constrained optimization problem, in which one maximizes a function of group-specific performance metrics based on (estimated) group-specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Privacy-Preserving Technologies in Data
