Striving for data-model efficiency: Identifying data externalities on group performance
Esther Rolf, Ben Packer, Alex Beutel, Fernando Diaz

TL;DR
This paper investigates how adding training data can sometimes harm subgroup performance in machine learning, highlighting the importance of understanding data-model interactions for building trustworthy AI.
Contribution
It introduces the concept of negative data externalities on group performance and analyzes how data and model size influence these externalities.
Findings
Data externalities can reduce subgroup performance when adding data.
Externalities vary with training set size and model complexity.
Understanding externalities is crucial for improving data efficiency.
Abstract
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance. In this work, we seek to better understand how we might characterize, detect, and design for data-model synergies. We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population, a phenomenon we refer to as negative data externalities on group performance. Such externalities can arise in standard learning settings and can manifest differently depending on conditions between training set size and model size. Data externalities directly imply a lower bound on feasible model improvements, yet improving models efficiently requires understanding the underlying data-model tensions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Forecasting Techniques and Applications
