SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images

Dongchen Si; Di Wang; Erzhong Gao; Xiaolei Qin; Liu Zhao; Jing Zhang; Minqiang Xu; Jianbo Zhan; Jianshe Wang; Lin Liu; Bo Du; Liangpei Zhang

arXiv:2508.05202·cs.CV·March 10, 2026

SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images

Dongchen Si, Di Wang, Erzhong Gao, Xiaolei Qin, Liu Zhao, Jing Zhang, Minqiang Xu, Jianbo Zhan, Jianshe Wang, Lin Liu, Bo Du, Liangpei Zhang

PDF

TL;DR

SPEX is a novel multimodal vision-language model that leverages spectral priors encoded in textual attributes to improve pixel-level land cover extraction in spectral remote sensing images, outperforming existing methods.

Contribution

The paper introduces SPEX, the first multimodal vision-language model specifically designed for land cover extraction in spectral remote sensing imagery, utilizing a new spectral instruction-following dataset and advanced training strategies.

Findings

01

SPEX outperforms state-of-the-art methods on five public datasets.

02

It effectively extracts land cover categories like vegetation, buildings, and water.

03

The model can generate textual explanations for its predictions.

Abstract

Spectral information has long been recognized as a critical cue in remote sensing observations. Although numerous vision-language models have been developed for pixel-level interpretation, spectral information remains underutilized, resulting in suboptimal performance, particularly in multispectral scenarios. To address this limitation, we construct a vision-language instruction-following dataset named SPIE, which encodes spectral priors of land-cover objects into textual attributes recognizable by large language models (LLMs), based on classical spectral index computations. Leveraging this dataset, we propose SPEX, a multimodal LLM designed for instruction-driven land cover extraction. To this end, we introduce several carefully designed components and training strategies, including multiscale feature aggregation, token context condensation, and multispectral visual pre-training, to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.