TL;DR
This study systematically examines how the geographic and spectral diversity of pretraining datasets influence geospatial model performance, revealing spectral diversity as a key factor and providing new datasets and models.
Contribution
It introduces a comprehensive analysis of pretraining data diversity effects, highlighting spectral diversity's importance and releasing new datasets and models.
Findings
Pretraining on European data outperforms other datasets on global and local tasks.
Spectral diversity is strongly correlated with downstream performance.
Other diversity measures like continent, biome, landcover are weakly correlated.
Abstract
New geospatial foundation models introduce a new model architecture and pretraining dataset, often sampled using different notions of data diversity. Performance differences are largely attributed to the model architecture or input modalities, while the role of the pretraining dataset is rarely studied. To address this research gap, we conducted a systematic study on how the geographic composition of pretraining data affects a model's downstream performance. We created global and per-continent pretraining datasets and evaluated them on global and per-continent downstream datasets. We found that the pretraining dataset from Europe outperformed global and continent-specific pretraining datasets on both global and local downstream evaluations. To investigate the factors influencing a pretraining dataset's downstream performance, we analysed 10 pretraining datasets using diversity across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
