Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

Amandeep Kaur; Mirali Purohit; Gedeon Muhawenayo; Esther Rolf; Hannah Kerner

arXiv:2604.21104·cs.CV·April 24, 2026

Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

Amandeep Kaur, Mirali Purohit, Gedeon Muhawenayo, Esther Rolf, Hannah Kerner

PDF

1 Repo

TL;DR

This study systematically examines how the geographic and spectral diversity of pretraining datasets influence geospatial model performance, revealing spectral diversity as a key factor and providing new datasets and models.

Contribution

It introduces a comprehensive analysis of pretraining data diversity effects, highlighting spectral diversity's importance and releasing new datasets and models.

Findings

01

Pretraining on European data outperforms other datasets on global and local tasks.

02

Spectral diversity is strongly correlated with downstream performance.

03

Other diversity measures like continent, biome, landcover are weakly correlated.

Abstract

New geospatial foundation models introduce a new model architecture and pretraining dataset, often sampled using different notions of data diversity. Performance differences are largely attributed to the model architecture or input modalities, while the role of the pretraining dataset is rarely studied. To address this research gap, we conducted a systematic study on how the geographic composition of pretraining data affects a model's downstream performance. We created global and per-continent pretraining datasets and evaluated them on global and per-continent downstream datasets. We found that the pretraining dataset from Europe outperformed global and continent-specific pretraining datasets on both global and local downstream evaluations. To investigate the factors influencing a pretraining dataset's downstream performance, we analysed 10 pretraining datasets using diversity across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kerner-lab/pretrain-where
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.