Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision

Aadarsh Sahoo; Georgia Gkioxari

arXiv:2602.13195·cs.CV·February 16, 2026

Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision

Aadarsh Sahoo, Georgia Gkioxari

PDF

Open Access

TL;DR

This paper introduces a new framework for conversational image segmentation that incorporates abstract concepts, physical reasoning, and intent, supported by a novel benchmark and a scalable data generation method.

Contribution

It presents ConverSeg, a comprehensive benchmark, ConverSeg-Net, a new segmentation model, and an AI-powered data engine for scalable supervision in conversational image segmentation.

Findings

01

ConverSeg-Net outperforms existing models on the ConverSeg benchmark.

02

Current language-guided segmentation models are inadequate for complex conversational tasks.

03

The data engine enables scalable, supervised training without human annotation.

Abstract

Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., "left-most apple") and overlooks functional and physical reasoning (e.g., "where can I safely store the knife?"). We address this gap and introduce Conversational Image Segmentation (CIS) and ConverSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConverSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt-mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConverSeg-Net trained on our data engine achieves significant gains on ConverSeg and maintains strong performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)