Pixel-Level Reasoning Segmentation via Multi-turn Conversations
Dexian Cai, Xiaocui Yang, Yongkang Liu, Daling Wang, Shi Feng, Yifei, Zhang, Soujanya Poria

TL;DR
This paper introduces a new pixel-level reasoning segmentation task based on multi-turn conversations, along with a dataset and a framework that outperform existing methods in fine-grained, interactive segmentation.
Contribution
The work presents a novel multi-turn conversational segmentation task, a new dataset PRIST, and the MIRAS framework that integrates pixel-level reasoning with multi-turn dialogue understanding.
Findings
MIRAS outperforms baseline methods in segmentation accuracy.
PRIST dataset contains 24k utterances from 8.3k scenarios.
Framework generates pixel-grounded explanations aligned with user intent.
Abstract
Existing visual perception systems focus on region-level segmentation in single-turn dialogues, relying on complex and explicit query instructions. Such systems cannot reason at the pixel level and comprehend dynamic user intent that changes over interaction. Our work tackles this issue by introducing a novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on multi-turn conversations, tracking evolving user intent via multi-turn interactions for fine-grained segmentation. To establish a benchmark for this novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k multi-turn conversational scenarios with segmentation targets. Building on PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning Segmentation framework, integrates pixel-level segmentation with robust multi-turn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Medical Image Segmentation Techniques · Visual Attention and Saliency Detection
MethodsFocus
