Phrase-Instance Alignment for Generalized Referring Segmentation

E-Ro Nguyen; Hieu Le; Dimitris Samaras; Michael S. Ryoo

arXiv:2411.15087·cs.CV·March 26, 2026

Phrase-Instance Alignment for Generalized Referring Segmentation

E-Ro Nguyen, Hieu Le, Dimitris Samaras, Michael S. Ryoo

PDF

Open Access

TL;DR

This paper introduces a novel approach to generalized referring segmentation by modeling phrase-instance alignment, enabling explicit grounding and improved performance on benchmark datasets.

Contribution

It reformulates GRES as an instance-level reasoning task with a phrase-object alignment loss, advancing interpretability and robustness.

Findings

01

Achieves 3.22% cIoU improvement on gRefCOCO

02

Attains 12.25% N-acc increase on Ref-ZOM

03

Enables explicit phrase-instance grounding

Abstract

Generalized Referring expressions can describe one object, several related objects, or none at all. Existing generalized referring segmentation (GRES) models treat all cases alike, predicting a single binary mask and ignoring how linguistic phrases correspond to distinct visual instances. To this end, we reformulate GRES as an instance-level reasoning problem, where the model first predicts multiple instance-aware object queries conditioned on the referring expression, then aligns each with its most relevant phrase. This alignment is enforced by a Phrase-Object Alignment (POA) loss that builds fine-grained correspondence between linguistic phrases and visual instances. Given these aligned object instance queries and their learned relevance scores, the final segmentation and the no-target case are both inferred through a unified relevance-weighted aggregation mechanism. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsSparse Evolutionary Training