TL;DR
AnchorSeg introduces a structured approach to reasoning segmentation by using language grounded query banks, explicitly disentangling semantic reasoning from spatial localization, leading to improved pixel-level segmentation accuracy.
Contribution
It reformulates reasoning segmentation as a structured conditional generation with explicit spatial grounding, introducing query banks and a novel training objective for better alignment.
Findings
Achieves state-of-the-art results on ReasonSeg with 67.7% gIoU.
Uses explicit language grounded query banks for better reasoning and localization.
Proposes Token–Mask Cycle Consistency for improved training alignment.
Abstract
Reasoning segmentation requires models to ground complex, implicit textual queries into precise pixel-level masks. Existing approaches rely on a single segmentation token , whose hidden state implicitly encodes both semantic reasoning and spatial localization, limiting the model's ability to explicitly disentangle what to segment from where to segment. We introduce AnchorSeg, which reformulates reasoning segmentation as a structured conditional generation process over image tokens, conditioned on language grounded query banks. Instead of compressing all semantic reasoning and spatial localization into a single embedding, AnchorSeg constructs an ordered sequence of query banks: latent reasoning tokens that capture intermediate semantic states, and a segmentation anchor token that provides explicit spatial grounding. We model spatial conditioning as a factorized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
