Multi-modal, multi-scale representation learning for satellite imagery analysis just needs a good ALiBi

Patrick Kage; Pavlos Andreadis

arXiv:2604.10347·cs.CV·April 14, 2026

Multi-modal, multi-scale representation learning for satellite imagery analysis just needs a good ALiBi

Patrick Kage, Pavlos Andreadis

PDF

TL;DR

This paper introduces Scale-ALiBi, a linear bias transformer attention mechanism that enhances multi-scale, multi-modal satellite imagery analysis by encoding relationships across different resolutions and modes.

Contribution

The paper proposes Scale-ALiBi, a novel attention mechanism for multi-resolution, multi-modal satellite imagery, along with an implementation and a new dataset for benchmarking.

Findings

01

Improved performance on the GEO-Bench benchmark.

02

Implementation of Scale-ALiBi over diverse satellite data.

03

Public release of a curated multi-modal satellite dataset.

Abstract

Vision foundation models have been shown to be effective at processing satellite imagery into representations fit for downstream tasks, however, creating models which operate over multiple spatial resolutions and modes is challenging. This paper presents Scale-ALiBi, a linear bias transformer attention mechanism with a spatial encoding bias to relationships between image patches at different ground sample distance scales. We provide an implementation of Scale-ALiBi over a dataset of aligned high- and low-resolution optical and low-resolution SAR satellite imagery data using a triple-contrastive and reconstructive architecture, show an improvement on the GEO-Bench benchmark, and release the newly curated dataset publicly.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.