Adaptive Sampling Strategies to Construct Equitable Training Datasets

William Cai; Ro Encarnacion; Bobbie Chern; Sam Corbett-Davies; Miranda; Bogen; Stevie Bergman; Sharad Goel

arXiv:2202.01327·cs.LG·February 4, 2022·1 cites

Adaptive Sampling Strategies to Construct Equitable Training Datasets

William Cai, Ro Encarnacion, Bobbie Chern, Sam Corbett-Davies, Miranda, Bogen, Stevie Bergman, Sharad Goel

PDF

Open Access

TL;DR

This paper introduces a statistical framework for creating equitable training datasets by optimally allocating data collection resources across groups, improving model fairness in diverse applications.

Contribution

It formalizes the problem of equitable dataset construction as a constrained optimization task and proposes an adaptive sampling strategy that outperforms common heuristics.

Findings

01

Adaptive sampling outperforms equal and proportional sampling.

02

The framework efficiently solves data allocation without prior knowledge of learning rates.

03

Application to genomic data demonstrates improved fairness in polygenic risk scores.

Abstract

In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to these performance gaps is a lack of representation in the data the models are trained on. It is often unclear, however, how to operationalize representativeness in specific applications. Here we formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem. We consider a setting where a model builder must decide how to allocate a fixed data collection budget to gather training data from different subgroups. We then frame dataset creation as a constrained optimization problem, in which one maximizes a function of group-specific performance metrics based on (estimated) group-specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Privacy-Preserving Technologies in Data