Data Debiasing with Datamodels (D3M): Improving Subgroup Robustness via Data Selection
Saachi Jain, Kimia Hamidieh, Kristian Georgiev, Andrew Ilyas, Marzyeh, Ghassemi, Aleksander Madry

TL;DR
This paper introduces D3M, a data debiasing method that selectively removes training examples causing subgroup failures, improving model robustness without needing group labels or extensive tuning.
Contribution
D3M is a novel data selection technique that isolates and removes problematic examples to enhance subgroup robustness without requiring group annotations.
Findings
Effective in improving subgroup performance
Removes minimal data points for debiasing
Does not need group labels or hyperparameter tuning
Abstract
Machine learning models can fail on subgroups that are underrepresented during training. While techniques such as dataset balancing can improve performance on underperforming groups, they require access to training group annotations and can end up removing large portions of the dataset. In this paper, we introduce Data Debiasing with Datamodels (D3M), a debiasing approach which isolates and removes specific training examples that drive the model's failures on minority groups. Our approach enables us to efficiently train debiased classifiers while removing only a small number of examples, and does not require training group annotations or additional hyperparameter tuning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Clustering Algorithms Research · Text and Document Classification Technologies
