GRASP: Geospatial pixel Reasoning viA Structured Policy learning
Chengjie Jiang, Yunqi Zhou, Jiafeng Yan, Jing Li, Jiayang Li, Yue Zhou, Hongjie He, Jonathan Li

TL;DR
GRASP introduces a reinforcement learning-based framework that improves geospatial pixel reasoning by reducing annotation costs and enhancing out-of-domain generalization, achieving state-of-the-art results.
Contribution
The paper presents GRASP, a novel structured policy-learning framework combining multimodal models with reinforcement learning and cost-effective supervision for geospatial segmentation.
Findings
State-of-the-art in-domain performance achieved.
Up to 54% improvement in out-of-domain scenarios.
Effective reduction of annotation costs using bounding boxes and points.
Abstract
Geospatial pixel reasoning aims to generate segmentation masks in remote sensing imagery directly from natural-language instructions. Most existing approaches follow a paradigm that fine-tunes multimodal large language models under supervision with dense pixel-level masks as ground truth. While effective within the training data distribution, this design suffers from two main drawbacks: (1) the high cost of large-scale dense mask annotation, and (2) the limited generalization capability of supervised fine-tuning in out-of-domain scenarios. To address these issues, we propose GRASP, a structured policy-learning framework that integrates a multimodal large language model with a pretrained segmentation model in a cascaded manner. To enhance generalization, we introduce PRIME, a training paradigm that replaces supervised fine-tuning with reinforcement learning to better align reasoning and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
