SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts

Jiaqi Liu; Ronghao Fu; Lang Sun; Haoran Liu; Xiao Yang; Weipeng Zhang; Xu Na; Zhuoran Duan; Bo Yang

arXiv:2512.02517·cs.CV·December 3, 2025

SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts

Jiaqi Liu, Ronghao Fu, Lang Sun, Haoran Liu, Xiao Yang, Weipeng Zhang, Xu Na, Zhuoran Duan, Bo Yang

PDF

Open Access 1 Video

TL;DR

SkyMoE is a novel vision-language model using Mixture-of-Experts for improved multi-task, multi-granularity remote sensing interpretation, outperforming existing models on diverse datasets.

Contribution

Introduces SkyMoE, a Mixture-of-Experts model with adaptive routing and a new benchmark for comprehensive geospatial interpretation tasks.

Findings

01

Achieves state-of-the-art results on 21 datasets.

02

Demonstrates superior multi-granularity understanding.

03

Validates effectiveness of expert decoupling strategy.

Abstract

The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal for remote sensing (RS) tasks. Existing geospatial VLMs typically adopt a unified modeling strategy and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception and global contextual understanding. In this paper, we present SkyMoE, a Mixture-of-Experts (MoE) vision-language model tailored for multimodal, multi-task RS interpretation. SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions, enabling specialized large language model experts to handle diverse sub-tasks. To further promote expert decoupling and granularity sensitivity, we introduce a context-disentangled augmentation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Remote-Sensing Image Classification