Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Optical Remote Sensing

Xu Zhang; Junyao Ge; Yang Zheng; Kaitai Guo; Jimin Liang

arXiv:2512.19302·cs.CV·April 22, 2026

Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Optical Remote Sensing

Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, Jimin Liang

PDF

TL;DR

This paper introduces Think2Seg-RS, a decoupled framework that enhances remote sensing segmentation by integrating large vision-language models with a frozen segmentation model through geometric prompts and reinforcement learning.

Contribution

It proposes a novel decoupled LVLM-SAM framework that improves reasoning segmentation in remote sensing by translating semantic reasoning into spatially grounded actions.

Findings

01

Achieves state-of-the-art performance on EarthReason dataset with 75.60% cIoU.

02

Outperforms leading methods like RemoteReasoner and SegEarth-R1.

03

Reveals fundamental differences between semantic grounding and instance segmentation tasks.

Abstract

Large Vision--Language Models (LVLMs) hold great promise for advancing optical remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only Group Relative Policy Optimization (GRPO) reinforcement learning objective driven strictly by final mask IoU, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Notably, Think2Seg-RS outperforms leading approaches such as RemoteReasoner and SegEarth-R1 on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.