Rethinking Electro-Optical Vision Foundation Models for Remote Sensing Retrieval: A Controlled Comparison with Generalist VFM
Hyobin Park, Minseok Seo, Dong-Geol Choi

TL;DR
This study compares electro-optical and generalist vision foundation models for remote sensing image retrieval, finding that generalist models often perform as well or better, especially in cross-scene scenarios.
Contribution
It provides a controlled evaluation showing that generalist models are competitive with EO-specific models for remote sensing retrieval tasks.
Findings
Generalist models are often as effective or better than EO-specific models.
EO-specific models degrade significantly under cross-scene evaluation.
Pretraining strategies for EO models need improvement to better utilize remote sensing data.
Abstract
Vision foundation models have attracted significant attention for their ability to leverage large-scale unlabeled visual data. This advantage is particularly important in remote sensing, where data acquisition is costly and annotation often requires expert knowledge. Recent electro-optical vision foundation models aim to learn domain-specific representations from remote sensing imagery, but it remains unclear whether they are more effective than strong generalist vision foundation models under retrieval-based evaluation. In this study, we conduct a controlled comparison between representative EO-specific and generalist vision foundation models for remote sensing image retrieval. Using the same datasets, retrieval protocol, and evaluation metric, we evaluate both in-domain performance and cross-scene generalization. Our results show that strong generalist vision foundation models are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
