Multi-modal, multi-scale representation learning for satellite imagery analysis just needs a good ALiBi
Patrick Kage, Pavlos Andreadis

TL;DR
This paper introduces Scale-ALiBi, a linear bias transformer attention mechanism that enhances multi-scale, multi-modal satellite imagery analysis by encoding relationships across different resolutions and modes.
Contribution
The paper proposes Scale-ALiBi, a novel attention mechanism for multi-resolution, multi-modal satellite imagery, along with an implementation and a new dataset for benchmarking.
Findings
Improved performance on the GEO-Bench benchmark.
Implementation of Scale-ALiBi over diverse satellite data.
Public release of a curated multi-modal satellite dataset.
Abstract
Vision foundation models have been shown to be effective at processing satellite imagery into representations fit for downstream tasks, however, creating models which operate over multiple spatial resolutions and modes is challenging. This paper presents Scale-ALiBi, a linear bias transformer attention mechanism with a spatial encoding bias to relationships between image patches at different ground sample distance scales. We provide an implementation of Scale-ALiBi over a dataset of aligned high- and low-resolution optical and low-resolution SAR satellite imagery data using a triple-contrastive and reconstructive architecture, show an improvement on the GEO-Bench benchmark, and release the newly curated dataset publicly.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
