Specialized Foundation Models Struggle to Beat Supervised Baselines
Zongzhe Xu, Ritvik Gupta, Wenduo Cheng, Alexander Shen, Junhong Shen,, Ameet Talwalkar, Mikhail Khodak

TL;DR
This paper shows that in specialized scientific domains like genomics, satellite imaging, and time series, simple supervised models often outperform large foundation models, highlighting the need for strong baselines.
Contribution
It introduces automated workflows for fair comparison between foundation models and supervised baselines in specialized fields.
Findings
Simple supervised models match or outperform foundation models in tested domains.
Large-scale pretraining benefits are not yet realized in many specialized areas.
Provides open-source tools for benchmarking foundation models against strong supervised baselines.
Abstract
Following its success for vision and text, the "foundation model" (FM) paradigm -- pretraining large models on massive data, then fine-tuning on target tasks -- has rapidly expanded to domains in the sciences, engineering, healthcare, and beyond. Has this achieved what the original FMs accomplished, i.e. the supplanting of traditional supervised learning in their domains? To answer we look at three modalities -- genomics, satellite imaging, and time series -- with multiple recent FMs and compare them to a standard supervised learning workflow: model development, hyperparameter tuning, and training, all using only data from the target task. Across these three specialized domains, we find that it is consistently possible to train simple supervised models -- no more complicated than a lightly modified wide ResNet or UNet -- that match or even outperform the latest foundation models. Our…
Peer Reviews
Decision·ICLR 2025 Poster
1. This paper is well-organized and well-written. The research questions are both interesting and practical, particularly in highlighting that some foundation models (FMs) are not fully comparable to in-domain, supervised baseline models. The experimental setup and results are comprehensive, effectively supporting the study’s objectives and providing valuable insights into the performance of FMs relative to tailored supervised models. 2. The two proposed automated supervised pipelines, DASHA an
1. The strong performance of supervised methods through the automated pipeline may be largely attributed to the large-scale training data. As noted in the data statistics in the appendix, nearly all datasets contain over 10,000 samples for training, which is sufficient for effective supervised training from scratch, even without pretrained checkpoints. For a comprehensive evaluation and to support the claim that pure supervised learning can outperform pretraining, additional experiments varying
1. The benchmarks and datasets chosen for each domain appear to comprehensively cover relevant applications in genomics, satellite imaging, and time series. 2. The paper highlights computational efficiency as a priority by contrasting the costs associated with FMs versus the supervised baselines. Holds high relevance given the fast adoption of large, resource-intensive models in research and industry. 3. Contextualized the results within each domain, providing a nuanced understanding of the mo
1. The paper lacks an exploration of their applicability to other types of data or more complex, multi-modal tasks. A broader discussion of potential limitations could help future researchers understand the contexts in which DASHA and Auto-AR might or might not succeed. 2. The choice of baselines (e.g., Wide ResNet and UNet for genomics and satellite imaging) is reasonable but could be expanded. Although the selected baselines demonstrate competitive performance, further comparisons with more r
1) The paper addresses a highly relevant topic that warrants investigation, given the growing popularity of FMs and the ongoing efforts within the community to adapt these models to specialized domains beyond natural images and text. 2) The results are surprising and critical to keep a fair benchmarking of future FMs against other pipelines. The actual gains in data efficiency scored by the proposed baselines (nice summary in Figure 1) range from three to five orders of magnitude in the three t
1) The paper presents two "solutions" from an architectural standpoint, i.e., CNNs for genomics/satellite data and auto-AR for time series. This leads to two questions: - What model should a practitioner use when facing a task in a different specialized domain? - How come 1D CNNs (or, e.g., RNNs) fail to match the performance of FMs in time series forecasting? Is it just a matter of domain knowledge and data type? In practice, the paper would benefit from some more convincing justification an
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGrouting, Rheology, and Soil Mechanics · Dam Engineering and Safety · Geotechnical Engineering and Underground Structures
MethodsAverage Pooling · Max Pooling · Global Average Pooling · Convolution · Kaiming Initialization
