From Pixels to Concepts: Do Segmentation Models Understand What They Segment?
Shuang Liang, Zeqing Wang, Yuxian Li, Xihui Liu, Han Wang

TL;DR
This paper introduces CAFE, a benchmark for evaluating whether segmentation models truly understand concepts by testing their responses to attribute-level counterfactual modifications.
Contribution
The paper presents CAFE, a novel benchmark with counterfactual attribute manipulations to assess concept-faithful segmentation in promptable models.
Findings
Models often produce accurate masks for misleading prompts, indicating reliance on visual cues.
There is a gap between localization accuracy and true concept understanding.
CAFE enables diagnosis of semantic grounding versus shortcut-driven mask retrieval.
Abstract
Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: \textbf{C}ounterfactual \textbf{A}ttribute \textbf{F}actuality \textbf{E}valuation, a novel benchmark for evaluating concept-faithful segmentation in promptable segmentation models. Our \textbf{CAFE} is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
