TL;DR
This paper presents a foundation model for high-resolution Dutch satellite imagery that combines CNN and Vision Transformer, leveraging temporal data to improve land-cover understanding and generalize well globally.
Contribution
The study introduces a novel multi-modal model that captures both spatial and temporal features, achieving competitive results with fewer parameters and limited data.
Findings
Model improves vegetation monitoring accuracy with temporal data.
Achieves competitive performance on global benchmarks.
Uses fewer parameters and less data than state-of-the-art models.
Abstract
We develop a foundation model using 1.2m high resolution satellite images of the Netherlands. By combining a Convolutional Neural Network and a Vision Transformer, the model captures both low- and high-frequency landscape features, such as fine textures, edges, and small objects as well as large terrain structures, elevation patterns, and land-cover distributions. Leveraging temporal data as input, the model learns from broader contextual information across time, allowing the model to exploit the temporal dependencies, such as topographic features, land-cover changes, and seasonal dynamics. These additional constraints reduce feature ambiguity, improve representation learning, and enable better generalization with fewer labeled samples. The foundation model is evaluated on multiple downstream tasks, ranging from use cases within the Netherlands to global benchmarking datasets. On the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
