SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images
Dongchen Si, Di Wang, Erzhong Gao, Xiaolei Qin, Liu Zhao, Jing Zhang, Minqiang Xu, Jianbo Zhan, Jianshe Wang, Lin Liu, Bo Du, Liangpei Zhang

TL;DR
SPEX is a novel multimodal vision-language model that leverages spectral priors encoded in textual attributes to improve pixel-level land cover extraction in spectral remote sensing images, outperforming existing methods.
Contribution
The paper introduces SPEX, the first multimodal vision-language model specifically designed for land cover extraction in spectral remote sensing imagery, utilizing a new spectral instruction-following dataset and advanced training strategies.
Findings
SPEX outperforms state-of-the-art methods on five public datasets.
It effectively extracts land cover categories like vegetation, buildings, and water.
The model can generate textual explanations for its predictions.
Abstract
Spectral information has long been recognized as a critical cue in remote sensing observations. Although numerous vision-language models have been developed for pixel-level interpretation, spectral information remains underutilized, resulting in suboptimal performance, particularly in multispectral scenarios. To address this limitation, we construct a vision-language instruction-following dataset named SPIE, which encodes spectral priors of land-cover objects into textual attributes recognizable by large language models (LLMs), based on classical spectral index computations. Leveraging this dataset, we propose SPEX, a multimodal LLM designed for instruction-driven land cover extraction. To this end, we introduce several carefully designed components and training strategies, including multiscale feature aggregation, token context condensation, and multispectral visual pre-training, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
