GRASP: Geospatial pixel Reasoning viA Structured Policy learning

Chengjie Jiang; Yunqi Zhou; Jiafeng Yan; Jing Li; Jiayang Li; Yue Zhou; Hongjie He; Jonathan Li

arXiv:2508.17102·cs.CV·October 29, 2025

GRASP: Geospatial pixel Reasoning viA Structured Policy learning

Chengjie Jiang, Yunqi Zhou, Jiafeng Yan, Jing Li, Jiayang Li, Yue Zhou, Hongjie He, Jonathan Li

PDF

TL;DR

GRASP introduces a reinforcement learning-based framework that improves geospatial pixel reasoning by reducing annotation costs and enhancing out-of-domain generalization, achieving state-of-the-art results.

Contribution

The paper presents GRASP, a novel structured policy-learning framework combining multimodal models with reinforcement learning and cost-effective supervision for geospatial segmentation.

Findings

01

State-of-the-art in-domain performance achieved.

02

Up to 54% improvement in out-of-domain scenarios.

03

Effective reduction of annotation costs using bounding boxes and points.

Abstract

Geospatial pixel reasoning aims to generate segmentation masks in remote sensing imagery directly from natural-language instructions. Most existing approaches follow a paradigm that fine-tunes multimodal large language models under supervision with dense pixel-level masks as ground truth. While effective within the training data distribution, this design suffers from two main drawbacks: (1) the high cost of large-scale dense mask annotation, and (2) the limited generalization capability of supervised fine-tuning in out-of-domain scenarios. To address these issues, we propose GRASP, a structured policy-learning framework that integrates a multimodal large language model with a pretrained segmentation model in a cascaded manner. To enhance generalization, we introduce PRIME, a training paradigm that replaces supervised fine-tuning with reinforcement learning to better align reasoning and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.