Co-Training Vision Language Models for Remote Sensing Multi-task Learning

Qingyun Li; Shuran Ma; Junwei Luo; Yi Yu; Yue Zhou; Fengxiang Wang; Xudong Lu; Xiaoxing Wang; Xin He; Yushi Chen; Xue Yang

arXiv:2511.21272·cs.CV·January 12, 2026

Co-Training Vision Language Models for Remote Sensing Multi-task Learning

Qingyun Li, Shuran Ma, Junwei Luo, Yi Yu, Yue Zhou, Fengxiang Wang, Xudong Lu, Xiaoxing Wang, Xin He, Yushi Chen, Xue Yang

PDF

Open Access 2 Models 1 Datasets

TL;DR

This paper introduces RSCoVLM, a versatile vision-language model for remote sensing multi-task learning, featuring innovative data processing, dynamic resolution strategies, and a Zoom-in Chain mechanism, achieving state-of-the-art results across various tasks.

Contribution

The paper presents RSCoVLM, a flexible VLM baseline for RS MTL with novel data curation, dynamic resolution handling, and a Zoom-in Chain for ultra-high-resolution images, advancing multi-task remote sensing models.

Findings

01

Achieves state-of-the-art performance on multiple RS tasks.

02

Outperforms existing RS vision-language models.

03

Provides open-source tools and datasets for reproducibility.

Abstract

With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Qingyun/remote-sensing-sft-data
dataset· 4.2k dl
4.2k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning