Data Debiasing with Datamodels (D3M): Improving Subgroup Robustness via   Data Selection

Saachi Jain; Kimia Hamidieh; Kristian Georgiev; Andrew Ilyas; Marzyeh; Ghassemi; Aleksander Madry

arXiv:2406.16846·cs.LG·June 25, 2024·1 cites

Data Debiasing with Datamodels (D3M): Improving Subgroup Robustness via Data Selection

Saachi Jain, Kimia Hamidieh, Kristian Georgiev, Andrew Ilyas, Marzyeh, Ghassemi, Aleksander Madry

PDF

Open Access

TL;DR

This paper introduces D3M, a data debiasing method that selectively removes training examples causing subgroup failures, improving model robustness without needing group labels or extensive tuning.

Contribution

D3M is a novel data selection technique that isolates and removes problematic examples to enhance subgroup robustness without requiring group annotations.

Findings

01

Effective in improving subgroup performance

02

Removes minimal data points for debiasing

03

Does not need group labels or hyperparameter tuning

Abstract

Machine learning models can fail on subgroups that are underrepresented during training. While techniques such as dataset balancing can improve performance on underperforming groups, they require access to training group annotations and can end up removing large portions of the dataset. In this paper, we introduce Data Debiasing with Datamodels (D3M), a debiasing approach which isolates and removes specific training examples that drive the model's failures on minority groups. Our approach enables us to efficiently train debiased classifiers while removing only a small number of examples, and does not require training group annotations or additional hyperparameter tuning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Advanced Clustering Algorithms Research · Text and Document Classification Technologies