Unraveling the Key Components of OOD Generalization via Diversification
Harold Benoit, Liangze Jiang, Andrei Atanov, O\u{g}uzhan Fatih Kar,, Mattia Rigotti, Amir Zamir

TL;DR
This paper investigates diversification methods for out-of-distribution generalization, revealing their sensitivity to data distribution, the importance of learning algorithms, and the limitations of increasing hypotheses, thereby guiding future research and practice.
Contribution
The study identifies key factors affecting diversification methods' effectiveness for OOD generalization, emphasizing the roles of data distribution and learning algorithms.
Findings
Diversification methods are sensitive to unlabeled data distribution.
Algorithm choice critically impacts OOD performance.
Increasing hypotheses does not mitigate identified pitfalls.
Abstract
Supervised learning datasets may contain multiple cues that explain the training set equally well, i.e., learning any of them would lead to the correct predictions on the training data. However, many of them can be spurious, i.e., lose their predictive power under a distribution shift and consequently fail to generalize to out-of-distribution (OOD) data. Recently developed "diversification" methods (Lee et al., 2023; Pagliardini et al., 2023) approach this problem by finding multiple diverse hypotheses that rely on different features. This paper aims to study this class of methods and identify the key components contributing to their OOD generalization abilities. We show that (1) diversification methods are highly sensitive to the distribution of the unlabeled data used for diversification and can underperform significantly when away from a method-specific sweet spot. (2)…
Peer Reviews
Decision·ICLR 2024 poster
This paper really dives into the intricacies of the diverse hypothesis generation problem and does a wonderful job illustrating how complex the problem truly it is; that is, success simultaneously depends on all variables. In my opinion, this message should be communicated more frequently in conference proceedings. In particular, proposition #1 is quite illuminating in that it shows how DivDis and DBAT select different diverse hypotheses from one another, and there are different regimes defined
This paper has a number of weaknesses. I found the presentation more confusing and dense than it could be: 1. In particular, there is some terminology and notation that can be improved for greater understanding. The "spurious ratio" index is a poorly named quantity because it's literally the accuracy of the selected hypothesis with respect to the true hypothesis h*. Namely, h* has the maximum spurious ratio value of 1.0, but it's definitely NOT spurious as it's the true hypotheis. Another nam
I really like this paper. Given so many methods being proposed for OOD generalization, it is important to take a step back and analyze which ones are likely to work and under what conditions. This paper finds that the literature on diversification of hypotheses may not be conceptually well-motivated. The key result is that the success of this technique depends on inductive bias of the model architecture and the same architecture may not work well for different kinds of test set. In hindsight
While the analysis is compelling, I'm wondering whether these limitations matter in practice. What if we do model selection over multiple architectures and multiple diversity algorithms? Is the risk that the results we get on a cross-validation set may not generalize to the test set? If so, what are the summary statistics of the test set that the above procedure would need to know? For example, if the spurious ratio of the (unseen) test set is known, can that be used to simulate a pseudo-test s
The work is original to the best of my knowledge. The paper is well written and clear. I have a few suggested typo-style edits in the weaknesses section, but the writing is certainly strong. I did not notice any errors or incorrect conclusions in the findings. Although I didn't have time to go through every result in extreme detail, I have a reasonable amount of confidence in the correctness of the results, generally. Generalization which is robust to out-of-distribution shifts is certainl
My main concern with the paper is whether its results are significant enough to merit acceptance at ICLR. I'm open to being persuaded that the paper is significant enough, but that's not clear to me, for a few different reasons. I offer these concerns with only moderate confidence, since I am not an expert specifically on distributional shift literature. The paper focuses largely on the pitfalls and limitations of 2 papers from ICLR 2023, the Lee DivDis and the Pagliardini D-Bat approaches. I
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Data Stream Mining Techniques · Anomaly Detection Techniques and Applications
MethodsSparse Evolutionary Training
