FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic   Segmentation

Bingyu Li; Da Zhang; Zhiyuan Zhao; Junyu Gao; Xuelong Li

arXiv:2501.00877·cs.CV·January 6, 2025

FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation

Bingyu Li, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

PDF

Open Access 1 Repo

TL;DR

FGAseg introduces a novel approach for open-vocabulary semantic segmentation by enhancing pixel-text alignment and boundary information using cross-modal attention and pseudo-masks, significantly improving performance over existing methods.

Contribution

The paper presents FGAseg, a model that achieves fine-grained pixel-text alignment and boundary supplementation, addressing limitations of VLMs in segmentation tasks.

Findings

01

Outperforms existing methods on open-vocabulary segmentation benchmarks

02

Effectively refines coarse CLIP alignment to pixel-level detail

03

Enhances boundary detection with pseudo-masks derived from similarity measures

Abstract

Open-vocabulary segmentation aims to identify and segment specific regions and objects based on text-based descriptions. A common solution is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between vision and text information. However, VLMs are typically pretrained for image-level vision-text alignment, focusing on global semantic features. In contrast, segmentation tasks require fine-grained pixel-level alignment and detailed category boundary information, which VLMs alone cannot provide. As a result, information extracted directly from VLMs can't meet the requirements of segmentation tasks. To address this limitation, we propose FGAseg, a model designed for fine-grained pixel-text alignment and category boundary supplementation. The core of FGAseg is a Pixel-Level Alignment module that employs a cross-modal attention mechanism and a text-pixel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LiBingyu01/FGA-seg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training