Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations
Olawale Salaudeen, Haoran Zhang, Kumail Alhamoud, Sara Beery, Marzyeh Ghassemi

TL;DR
This paper reveals that aggregate metrics can hide significant out-of-distribution (OOD) generalization failures caused by spurious correlations, by using a new method to identify problematic OOD subsets.
Contribution
The authors introduce OODSelect, a gradient-based method to uncover semantically coherent OOD subsets where accuracy-on-the-line does not hold, challenging previous assumptions.
Findings
Aggregate metrics can conceal OOD failures.
Higher ID accuracy can predict lower OOD accuracy in subsets.
Over half of some OOD sets may exhibit inverse accuracy relationships.
Abstract
Benchmarks for out-of-distribution (OOD) generalization frequently show a strong positive correlation between in-distribution (ID) and OOD accuracy across models, termed "accuracy-on-the-line." This pattern is often taken to imply that spurious correlations - correlations that improve ID but reduce OOD performance - are rare in practice. We find that this positive correlation is often an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy on the line does not hold. Across widely used distribution shift benchmarks, the OODSelect uncovers subsets, sometimes over half of the standard OOD set, where higher ID accuracy predicts lower OOD accuracy. Our findings indicate that aggregate metrics can obscure important failure modes of OOD robustness. We release code and the identified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
