Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations

Olawale Salaudeen; Haoran Zhang; Kumail Alhamoud; Sara Beery; Marzyeh Ghassemi

arXiv:2510.24884·cs.LG·October 30, 2025

Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations

Olawale Salaudeen, Haoran Zhang, Kumail Alhamoud, Sara Beery, Marzyeh Ghassemi

PDF

TL;DR

This paper reveals that aggregate metrics can hide significant out-of-distribution (OOD) generalization failures caused by spurious correlations, by using a new method to identify problematic OOD subsets.

Contribution

The authors introduce OODSelect, a gradient-based method to uncover semantically coherent OOD subsets where accuracy-on-the-line does not hold, challenging previous assumptions.

Findings

01

Aggregate metrics can conceal OOD failures.

02

Higher ID accuracy can predict lower OOD accuracy in subsets.

03

Over half of some OOD sets may exhibit inverse accuracy relationships.

Abstract

Benchmarks for out-of-distribution (OOD) generalization frequently show a strong positive correlation between in-distribution (ID) and OOD accuracy across models, termed "accuracy-on-the-line." This pattern is often taken to imply that spurious correlations - correlations that improve ID but reduce OOD performance - are rare in practice. We find that this positive correlation is often an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy on the line does not hold. Across widely used distribution shift benchmarks, the OODSelect uncovers subsets, sometimes over half of the standard OOD set, where higher ID accuracy predicts lower OOD accuracy. Our findings indicate that aggregate metrics can obscure important failure modes of OOD robustness. We release code and the identified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.