Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation

Jiachen Li; Hongyun Wang; Jinyu Xu; Wenbo Jiang; Yanchun Ma; Yongjian Liu; Qing Xie; Bolong Zheng

arXiv:2603.27993·cs.CV·March 31, 2026

Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation

Jiachen Li, Hongyun Wang, Jinyu Xu, Wenbo Jiang, Yanchun Ma, Yongjian Liu, Qing Xie, Bolong Zheng

PDF

TL;DR

PPCR introduces a progressive reasoning framework that combines large language models and prompt-guided spatial grounding to improve referring image segmentation accuracy.

Contribution

It proposes a novel structured reasoning pipeline with semantic and spatial prompts, enhancing cross-modal grounding in image segmentation tasks.

Findings

01

PPCR outperforms existing methods on standard benchmarks.

02

The framework effectively bridges semantic understanding and spatial reasoning.

03

Code will be publicly released for reproducibility.

Abstract

Referring image segmentation aims to localize and segment a target object in an image based on a free-form referring expression. The core challenge lies in effectively bridging linguistic descriptions with object-level visual representations, especially when referring expressions involve detailed attributes and complex inter-object relationships. Existing methods either rely on cross-modal alignment or employ Semantic Segmentation Prompts, but they often lack explicit reasoning mechanisms for grounding language descriptions to target regions in the image. To address these limitations, we propose PPCR, a Progressive Prompt-guided Cross-modal Reasoning framework for referring image segmentation. PPCR explicitly structures the reasoning process as a Semantic Understanding-Spatial Grounding-Instance Segmentation pipeline. Specifically, PPCR first employs multimodal large language models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.