RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought
Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, Wenbo Zhu

TL;DR
RSVP is a novel framework that combines multimodal reasoning with visual segmentation, enabling large language models to generate precise, interpretable visual masks through a two-stage process involving reasoning-driven localization and segmentation refinement.
Contribution
RSVP introduces a unified approach that explicitly models the interaction between multimodal reasoning and segmentation, achieving state-of-the-art results in visual grounding and segmentation tasks.
Findings
Surpasses state-of-the-art by up to +6.5 gIoU and +9.2 cIoU on ReasonSeg.
Achieves 49.7 mAP on SegInW in zero-shot setting.
Effectively integrates reasoning and segmentation for interpretable visual understanding.
Abstract
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable reasoning capability while lack explicit mechanisms for visual grounding and segmentation, creating a gap between cognitive reasoning and visual perception. To bridge this gap, we introduce Reasoning Segmentation via Visual Prompting (RSVP), a novel framework that unifies multi-step multimodal reasoning with grounded visual understanding. RSVP is a two-stage structuralized framework that integrates reasoning-driven localization with segmentation refinement. In the reasoning stage, RSVP employs multimodal chain-of-thought visual prompts to help MLLMs understand queries and infer targets, generating interpretable region proposals that enhance visual grounding. In segmentation stage, RSVP refines these proposals with a Vision-Language Segmentation Module (VLSM), seamlessly integrates textual and visual cues to produce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Visual Attention and Saliency Detection
