Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs

Song Zhang; Yanlong Chen; Yilin Li; Yining Chen; Zili Yi; Xiaowei Zhang; and Yawei Li

arXiv:2605.07562·cs.CV·May 11, 2026

Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs

Song Zhang, Yanlong Chen, Yilin Li, Yining Chen, Zili Yi, Xiaowei Zhang, and Yawei Li

PDF

TL;DR

This paper introduces ScaleEarth, a novel framework for remote sensing vision-language models that dynamically conditions on ground sampling distance (GSD) as a continuous variable, improving performance across diverse Earth-system tasks.

Contribution

The paper proposes CS-HLoRA, a continuous scale-conditioning method, and a new scale-aware dataset, GeoScale-VQA, advancing remote sensing VLMs beyond discrete GSD tokens.

Findings

01

ScaleEarth achieves state-of-the-art results on remote sensing benchmarks.

02

CS-HLoRA effectively modulates model computation based on GSD.

03

The approach enables dynamic routing and GSD prediction without sensor metadata.

Abstract

Remote sensing vision-language models (RS-VLMs) face a fundamental mismatch with natural-image counterparts: the same geographic object exhibits radically different visual evidence across ground sampling distances (GSDs) spanning multiple orders of magnitude. Yet existing RS-VLMs often discard GSD or inject it as a discrete text token, forcing a single static parameter set to absorb the entire scale spectrum. We introduce ScaleEarth, a parameter-efficient fine-tuning framework built on Qwen3-VL that treats GSD as a continuous conditioning variable governing the model's computation path. At its core, CS-HLoRA (Continuous Scale-Conditioned Hyper-LoRA) modulates the LoRA low-rank subspace through a GSD-driven gate, enabling the model to dynamically route computation by physical scale. To remove reliance on sensor metadata at deployment, we pair CS-HLoRA with SSE-U, a lightweight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.