TinyRS-R1: Compact Multimodal Language Model for Remote Sensing

Aybora Koksal; A. Aydin Alatan

arXiv:2505.12099·cs.CV·December 1, 2025

TinyRS-R1: Compact Multimodal Language Model for Remote Sensing

Aybora Koksal, A. Aydin Alatan

PDF

TL;DR

TinyRS-R1 is a compact 2B-parameter multimodal language model optimized for remote sensing, achieving comparable or better performance than larger models while significantly reducing memory and latency, and incorporating reasoning capabilities.

Contribution

Introduces TinyRS-R1, the first domain-specific multimodal language model with reasoning-augmented training for remote sensing applications.

Findings

01

TinyRS-R1 matches or exceeds 7B models in remote sensing tasks.

02

Reasoning training improves spatial grounding and scene understanding.

03

TinyRS-R1 requires only one-third of the memory and latency of larger models.

Abstract

Remote-sensing applications often run on edge hardware that cannot host today's 7B-parameter multimodal language models. This paper introduces TinyRS, the first 2B-parameter multimodal small language model (MSLM) optimized for remote sensing tasks, and TinyRS-R1, its reasoning-augmented variant. Built upon Qwen2-VL-2B, TinyRS is trained through a four-stage pipeline: pre-training on million satellite images, instruction tuning on visual instruction examples, fine-tuning with Chain-of-Thought (CoT) annotations from the proposed reasoning dataset, and alignment via Group Relative Policy Optimization (GRPO). TinyRS-R1 achieves or surpasses the performance of recent 7B-parameter remote sensing models across classification, VQA, visual grounding, and open-ended question answering-while requiring just one-third of the memory and latency. Our analysis shows that CoT reasoning substantially…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.