Ask Your Distribution Shift if Pre-Training is Right for You
Benjamin Cohen-Wang, Joshua Vendrow, Aleksander Madry

TL;DR
This paper investigates when pre-training helps with distribution shift robustness, finding it mitigates poor extrapolation but not dataset biases, and explores implications for model development and fine-tuning strategies.
Contribution
It characterizes the failure modes pre-training can address, providing theoretical and empirical evidence that pre-training helps with extrapolation but not biases.
Findings
Pre-training mitigates poor extrapolation under distribution shift.
Pre-training does not effectively address dataset biases.
Fine-tuning on small, de-biased datasets can outperform large biased datasets.
Abstract
Pre-training is a widely used approach to develop models that are robust to distribution shifts. However, in practice, its effectiveness varies: fine-tuning a pre-trained model improves robustness significantly in some cases but not at all in others (compared to training from scratch). In this work, we seek to characterize the failure modes that pre-training can and cannot address. In particular, we focus on two possible failure modes of models under distribution shift: poor extrapolation (e.g., they cannot generalize to a different domain) and biases in the training data (e.g., they rely on spurious features). Our study suggests that, as a rule of thumb, pre-training can help mitigate poor extrapolation but not dataset biases. After providing theoretical motivation and empirical evidence for this finding, we explore two of its implications for developing robust models: (1) pre-training…
Peer Reviews
Decision·Submitted to ICLR 2024
The paper is easy to follow and for the most part well-written. While the paper does not propose novel methods for robustness, it leads an important discussion on the interplay of pre-trained networks with training methods for group-robustness. It provides with a novel characterization of distribution shifts and it argues about the complementary utility of the two approaches around this characterization, with some theoretical and empirical insights. The reviewer believes that such an analysis pa
While their conclusion might not be incorrect, the design of several experiments is flawed and does not lead to the authors’ claims. The reviewer thinks that these consist a large enough body of the paper to lean towards possible acceptance. In particular: 1. In **Section 4/Figure 3**, there are multiple variables which are being ablated at the same time. The measured effective robustness is with respect the performance of ResNet18 models, however the pretrained models, which are compared to,
- The paper addresses the very interesting topic of exploring whether pre-training helps training performant models under different types of distribution shift. This is an important frontier in the increasingly adopted pre-training fine-tuning training methodology. - The ideas presented seem original and the relevant literature is sufficiently discussed. - The later sections on developing more robust models were interesting case studies on how to apply the insights from previous sections.
- Although more details are presented in the Appendix, I don't think that the setup for Theorem 3.1 is presented well in the paper. It is not clear why the logistic regression assumption is needed and the proof is not referenced in the main paper. It is also unclear what $proj_{W_{ref}}$ refers to as it is never formally introduced. Is orthogonal to be interpreted mathematically or figuratively? Also "[...] while the initialization determines how the model extends outside of $W_{ref}$": since $W
The problem being investigated is definitely of interest. With the emergence of foundation models, it is of growing interest to better understand the impact of pre-training data and its implications for downstream processes. This paper is well-contextualized. Its technical structure is plausible (intuitions, motivating examples, formal analysis, generalization, empirical verification, etc.).
This paper is a difficult read in general. I was quite attracted by the topic of this paper and had high hope until Theorem 1, which I could not understand after several attempts. **Many technical details are missing or inconsistent (e.g., missing key definitions, no details for important procedures, only providing references with no description at all), rendering many arguments ungrounded and hardly convincing.** - Theorem 1, key error–$w_{ref}$ is undefined, which is a crucial variable. With
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGlobal Health Workforce Issues
MethodsFocus
