RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning

Shiqi Huang; Shuting He; Bihan Wen

arXiv:2601.21634·cs.CV·January 30, 2026

RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning

Shiqi Huang, Shuting He, Bihan Wen

PDF

Open Access

TL;DR

This paper introduces RSGround-R1, a novel framework that enhances spatial reasoning in remote sensing visual grounding by leveraging synthetic data, reinforcement learning, and spatial consistency techniques to improve localization accuracy.

Contribution

The paper presents a reasoning-guided, position-aware post-training framework that significantly improves spatial understanding in remote sensing visual grounding tasks.

Findings

01

Superior performance on RSVG benchmarks

02

Enhanced spatial reasoning capabilities

03

Robust localization with spatial consistency

Abstract

Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions. Owing to the vast spatial scale and high semantic ambiguity of remote sensing scenes, these descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning. To leverage this unique feature, we propose a reasoning-guided, position-aware post-training framework, dubbed \textbf{RSGround-R1}, to progressively enhance spatial understanding. Specifically, we first introduce Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) using synthetically generated RSVG reasoning data to establish explicit position awareness. Reinforcement Fine-Tuning (RFT) is then applied, augmented by our newly designed positional reward that provides continuous and distance-aware guidance toward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning