RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought

Yi Lu; Jiawang Cao; Yongliang Wu; Bozheng Li; Licheng Tang; Yangguang Ji; Chong Wu; Jay Wu; Wenbo Zhu

arXiv:2506.04277·cs.CV·June 6, 2025

RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought

Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, Wenbo Zhu

PDF

Open Access 1 Video

TL;DR

RSVP is a novel framework that combines multimodal reasoning with visual segmentation, enabling large language models to generate precise, interpretable visual masks through a two-stage process involving reasoning-driven localization and segmentation refinement.

Contribution

RSVP introduces a unified approach that explicitly models the interaction between multimodal reasoning and segmentation, achieving state-of-the-art results in visual grounding and segmentation tasks.

Findings

01

Surpasses state-of-the-art by up to +6.5 gIoU and +9.2 cIoU on ReasonSeg.

02

Achieves 49.7 mAP on SegInW in zero-shot setting.

03

Effectively integrates reasoning and segmentation for interpretable visual understanding.

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable reasoning capability while lack explicit mechanisms for visual grounding and segmentation, creating a gap between cognitive reasoning and visual perception. To bridge this gap, we introduce Reasoning Segmentation via Visual Prompting (RSVP), a novel framework that unifies multi-step multimodal reasoning with grounded visual understanding. RSVP is a two-stage structuralized framework that integrates reasoning-driven localization with segmentation refinement. In the reasoning stage, RSVP employs multimodal chain-of-thought visual prompts to help MLLMs understand queries and infer targets, generating interpretable region proposals that enhance visual grounding. In segmentation stage, RSVP refines these proposals with a Vision-Language Segmentation Module (VLSM), seamlessly integrates textual and visual cues to produce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Visual Attention and Saliency Detection