In Search of Forgotten Domain Generalization
Prasanna Mayilvahanan, Roland S. Zimmermann, Thadd\"aus Wiedemer, Evgenia Rusak, Attila Juhos, Matthias Bethge, Wieland Brendel

TL;DR
This paper investigates the true out-of-domain generalization capabilities of models trained on large-scale web datasets, revealing that apparent robustness is often due to in-domain data and proposing new datasets for better evaluation.
Contribution
It introduces large-scale, style-distinct datasets from LAION for rigorous OOD evaluation and systematically explores data mixing strategies to enhance model robustness.
Findings
Models trained on web data perform mainly on in-domain examples.
Optimal data mixing ratios improve cross-domain generalization.
Web-scale training does not inherently guarantee true OOD robustness.
Abstract
Out-of-Domain (OOD) generalization is the ability of a model trained on one or more domains to generalize to unseen domains. In the ImageNet era of computer vision, evaluation sets for measuring a model's OOD performance were designed to be strictly OOD with respect to style. However, the emergence of foundation models and expansive web-scale datasets has obfuscated this evaluation process, as datasets cover a broad range of domains and risk test domain contamination. In search of the forgotten domain generalization, we create large-scale datasets subsampled from LAION -- LAION-Natural and LAION-Rendition -- that are strictly OOD to corresponding ImageNet and DomainNet test sets in terms of style. Training CLIP models on these datasets reveals that a significant portion of their performance is explained by in-domain examples. This indicates that the OOD generalization challenges from…
Peer Reviews
Decision·ICLR 2025 Spotlight
Here’s a refined take on the strengths for your review: 1. **Extremely important problem**: This paper addresses a critical issue in machine learning—distinguishing true OOD generalization from data contamination—by revisiting fundamental challenges of OOD generalization in the context of large-scale vision-language models. Their work highlights the limitations of current evaluation practices, underscoring the importance of ensuring genuine robustness and generalization capabilities. 2. **Impa
1. **Missing Literature:** The field of OOD (out of distribution) generalization is very rich, and there are several related papers that are not cited. It would be nice to have this work related to existing works in the field (See below) ### References a. Liu, J., Shen, Z., He, Y., Zhang, X., Xu, R., Yu, H. and Cui, P., 2021. *Towards out-of-distribution generalization: A survey*. arXiv preprint arXiv:2108.13624. b. Koh, P.W., Sagawa, S., Marklund, H., Xie, S.M., Zhang, M., Balsubramani, A.,
- S1: This work raises awareness of the importance of curating and performing in-depth analysis of large-scale datasets. - S2: One of the main contributions of this work is to release two curated partitions of the LAION dataset with non-overlapping domains. This will promote and facilitate future research on single-source domain generalization for foundation models such as CLIP.
- W1: The motivation for the main question this work aims at answering ("shed light on the limitations of foundation models like CLIP in handling OOD generalization") is not quite clear from the manuscript. In a few words, it seems this work is showing that when CLIP is trained and tested on the same distribution, it performs well, and performance is harmed once a particular domain is removed from the training set and the model is tested on it. What is the exact new insight from this result? Ple
1. This paper studies a valuable problem that has seldom been studied. Indeed, the OOD problem, which seems to be less of a problem in the era of LLM, might be an illusion due to the scaling of the massive data. This problem should be taken seriously, as there are still areas where collecting data is extremely difficult and scaling will likely fail. 2. Sound analysis that supports the claim.
1. Figure 6 A is a bit confusing, as the result of the best rendition-to-natural ratio 1:3 and 1:1 can not be read in this figure. I suggest adding ratio labels to the color scale or individual data points in the figure. 2. Can you provide some discussion on the "true effectiveness" of the domain classifier trained and evaluated on the curated domain datasets but is used in a different and much larger dataset, i.e. Laion-200M? * Will there be OOD shift between the curated domain datasets and
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReservoir Engineering and Simulation Methods
MethodsContrastive Language-Image Pre-training
