Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Optical Remote Sensing
Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, Jimin Liang

TL;DR
This paper introduces Think2Seg-RS, a decoupled framework that enhances remote sensing segmentation by integrating large vision-language models with a frozen segmentation model through geometric prompts and reinforcement learning.
Contribution
It proposes a novel decoupled LVLM-SAM framework that improves reasoning segmentation in remote sensing by translating semantic reasoning into spatially grounded actions.
Findings
Achieves state-of-the-art performance on EarthReason dataset with 75.60% cIoU.
Outperforms leading methods like RemoteReasoner and SegEarth-R1.
Reveals fundamental differences between semantic grounding and instance segmentation tasks.
Abstract
Large Vision--Language Models (LVLMs) hold great promise for advancing optical remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only Group Relative Policy Optimization (GRPO) reinforcement learning objective driven strictly by final mask IoU, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Notably, Think2Seg-RS outperforms leading approaches such as RemoteReasoner and SegEarth-R1 on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
