Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation

Chenhao Wang; Yingrui Ji; Yu Meng; Yao Zhu

arXiv:2605.15942·cs.CV·May 18, 2026

Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation

Chenhao Wang, Yingrui Ji, Yu Meng, Yao Zhu

PDF

TL;DR

This paper introduces a decomposed vision-language alignment framework that enhances fine-grained open-vocabulary segmentation by explicitly modeling semantic units and their interactions, leading to better generalization.

Contribution

It proposes a novel framework that factorizes textual prompts into concept and attribute tokens, with a feature-gated cross-attention module for improved compositional understanding.

Findings

01

Significantly improves generalization to unseen attribute-category combinations.

02

Effectively enforces compositional semantics through feature gating and log-space similarity aggregation.

03

Seamlessly integrates into existing transformer-based segmentation models.

Abstract

Open-vocabulary segmentation models often struggle to generalize to unseen combinations of object categories and attributes, because fine-grained descriptions are typically encoded as holistic sentences that entangle multiple semantic units. We propose a Decomposed Vision-Language Alignment framework that explicitly factorizes textual prompts into a concept token and multiple attribute tokens, enabling separate cross-modal interactions for each semantic unit. At the feature level, we introduce a Feature-Gated Cross-Attention module that generates attribute-specific gating maps to fuse information in a multiplicative manner, effectively enforcing compositional semantics. At the scoring level, per-token similarities are aggregated in log-space, producing a stable and interpretable compositional matching. The method can be seamlessly integrated into existing transformer-based segmentation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.