Reassessing the Validity of Spurious Correlations Benchmarks
Samuel J. Bell, Diane Bouchacourt, Levent Sagun

TL;DR
This paper critically evaluates existing spurious correlations benchmarks for neural networks, revealing inconsistencies and proposing criteria for more meaningful evaluation to improve robustness assessments.
Contribution
It introduces three desiderata for benchmark validity and offers practical guidance for selecting appropriate mitigation methods based on benchmark similarity.
Findings
Significant disagreement among existing benchmarks.
Some benchmarks do not reliably measure method performance.
Certain mitigation methods lack sufficient robustness.
Abstract
Neural networks can fail when the data contains spurious correlations. To understand this phenomenon, researchers have proposed numerous spurious correlations benchmarks upon which to evaluate mitigation methods. However, we observe that these benchmarks exhibit substantial disagreement, with the best methods on one benchmark performing poorly on another. We explore this disagreement, and examine benchmark validity by defining three desiderata that a benchmark should satisfy in order to meaningfully evaluate methods. Our results have implications for both benchmarks and mitigations: we find that certain benchmarks are not meaningful measures of method performance, and that several methods are not sufficiently robust for widespread use. We present a simple recipe for practitioners to choose methods using the most similar benchmark to their given problem.
Peer Reviews
Decision·Submitted to ICLR 2025
- **Originality**: The paper presents a novel approach to evaluating spurious correlation benchmarks by proposing three validity criteria—ERM Failure, Discriminative Power, and Convergent Validity. - **Quality**: The study is well-executed, with a thorough empirical analysis to assess the proposed validity criteria. The use of the Bayes Factor as a measure of task difficulty provides a quantifiable metric, helping to identify benchmark inconsistencies. - **Clarity**: Definitions of key concepts,
The methods discussed in the paper currently omit some recent state-of-the-art algorithms and techniques in spurious correlation research published before July 1, 2024, which would strengthen both the related work and Section 4. For instance, - Wang et al. "On the Effect of Key Factors in Spurious Correlation." AISTATS 2024. - Yang et al. "Identifying Spurious Biases Early in Training through the Lens of Simplicity Bias." AISTATS 2024. - Lin et al. "Spurious Feature Diversification Improves Out
The results provide insights both about which benchmarks are good indicators of mitigating spurious correlations and which methods are robust across different benchmarks, which can be useful to a variety of practitioners.
1. Calculating K requires two full training runs (one with ERM, one with reweighting). This is extremely resource-intensive, and the empirical results do not seem to show a significant enough improvement to warrant such a cost. 2. The variety of spurious correlation benchmarks is a problem that has been addressed in previous work (Joshi et al., 2023; Yang et al., 2023). A more detailed comparison of the observations in this work versus those in previous work would be appreciated. 3. Some parts
The papers main finding that Spurious Correlation Benchmarks (SCBs) often disagree with one another is interesting and definitely of interest. The experiments performed are sound and presented in a clear and manner. The prose of the paper are of good quality and in general it is easy to read and understand.
The biggest weakness of the paper is it is incorrectly titled, and the abstract is misleading. Spurious Correlations can be present outside of data sets with subpopulations shifts, however only subpopulation data sets and approaches have been considered in this work. While the authors note this in the discussion section, I find this to still be insufficient. In its current state I think the work would be much better titled “REASSESSING THE VALIDITY OF SUBPOPULATION SHIFT BENCHMARKS”. With this
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive and psychological constructs research · Evaluation and Performance Assessment
