FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation
Bingyu Li, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

TL;DR
FGAseg introduces a novel approach for open-vocabulary semantic segmentation by enhancing pixel-text alignment and boundary information using cross-modal attention and pseudo-masks, significantly improving performance over existing methods.
Contribution
The paper presents FGAseg, a model that achieves fine-grained pixel-text alignment and boundary supplementation, addressing limitations of VLMs in segmentation tasks.
Findings
Outperforms existing methods on open-vocabulary segmentation benchmarks
Effectively refines coarse CLIP alignment to pixel-level detail
Enhances boundary detection with pseudo-masks derived from similarity measures
Abstract
Open-vocabulary segmentation aims to identify and segment specific regions and objects based on text-based descriptions. A common solution is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between vision and text information. However, VLMs are typically pretrained for image-level vision-text alignment, focusing on global semantic features. In contrast, segmentation tasks require fine-grained pixel-level alignment and detailed category boundary information, which VLMs alone cannot provide. As a result, information extracted directly from VLMs can't meet the requirements of segmentation tasks. To address this limitation, we propose FGAseg, a model designed for fine-grained pixel-text alignment and category boundary supplementation. The core of FGAseg is a Pixel-Level Alignment module that employs a cross-modal attention mechanism and a text-pixel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training
