Is Large-Scale Pretraining the Secret to Good Domain Generalization?
Piotr Teterwak, Kuniaki Saito, Theodoros Tsiligkaridis, Bryan A., Plummer, Kate Saenko

TL;DR
This paper investigates whether large-scale pretraining is the key to domain generalization, introducing the Alignment Hypothesis that emphasizes the importance of embedding alignment for better performance on unseen domains.
Contribution
The paper proposes the Alignment Hypothesis, linking DG success to embedding alignment, and provides empirical analysis on existing methods using DomainBed datasets to validate this hypothesis.
Findings
DG methods perform poorly on Out-of-pretraining data
Recent methods excel on In-pretraining data
Embedding alignment correlates with DG performance
Abstract
Multi-Source Domain Generalization (DG) is the task of training on multiple source domains and achieving high classification performance on unseen target domains. Recent methods combine robust features from web-scale pretrained backbones with new features learned from source data, and this has dramatically improved benchmark results. However, it remains unclear if DG finetuning methods are becoming better over time, or if improved benchmark performance is simply an artifact of stronger pre-training. Prior studies have shown that perceptual similarity to pre-training data correlates with zero-shot performance, but we find the effect limited in the DG setting. Instead, we posit that having perceptually similar data in pretraining is not enough; and that it is how well these data were learned that determines performance. This leads us to introduce the Alignment Hypothesis, which states…
Peer Reviews
Decision·ICLR 2025 Poster
The paper is well-written and generally easy to follow. This paper presents some interesting insights: 1. Relation of low AlignmentScore and presence of label noise in the domain generalization datasets. 2. "We find that all methods, including those considered state-of-the-art, perform poorly on OOP data" which seems to suggest that DG methods do not learn anything beyond what has already been learned by CLIP during pre-training. This also aligns with some of the previous findings. 3
IP/OOP method of splitting data seems to be a bit circular. Some of the results are a bit difficult to interpret. For instance, Table 2 seems to represent the correlation of final performance and prediction with IP/OOP splits but presentations make it hard to understand. Further concerns are mentioned in the questions.
I appreciate that the authors consider the impact of pre-training weights in the current DG evaluation protocol. I also value their effort to remove noisy data labels in standard DG benchmarks. This paper presents a comprehensive set of experiments.
**Weaknesses** - Since the goal of CLIP's contrastive loss is to maximize the similarity between the ground truth text label embedding and image embedding, it's unsurprising that the alignment score (measured as similarity to the ground truth text embedding) correlates with final performance after fine-tuning. Additionally, the alignment score applies only to vision-language models, not to pure vision models, which is a limitation that should be mentioned in the paper. - When comparing the pre
The paper is well-written and considers an important question of understanding the importance of pre-training backbones in modern DG approaches. Creating a new benchmark is an important contribution.
The paper could be improved, especially in Section 3. The methods for identifying image similarity scores and alignment scores need clearer explanations instead of the brief sentences currently provided. I suggest including an algorithmic table, as this is a key contribution. While I appreciate the current contribution, it seems limited since it only considers one variant of the DG problem. The community has explored a broader range of generalization tasks including sub-population shifts. It w
Videos
Taxonomy
TopicsCryptography and Residue Arithmetic · Adversarial Robustness in Machine Learning
