Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models

Muhammad Atta ur Rahman; Dooseop Choi; Seung-Ik Lee; KyoungWook Min

arXiv:2501.16769·cs.CV·July 3, 2025

Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models

Muhammad Atta ur Rahman, Dooseop Choi, Seung-Ik Lee, KyoungWook Min

PDF

Open Access

TL;DR

This paper introduces 'Beyond-Labels', a lightweight fusion module that enhances open-vocabulary semantic segmentation by leveraging pre-trained vision-language models and Fourier embeddings, achieving improved performance with minimal retraining.

Contribution

The study proposes a novel fusion module and positional encoding method that enable efficient adaptation of pre-trained models for open-vocabulary segmentation tasks.

Findings

01

Outperforms existing methods on PASCAL-5i benchmark

02

Uses minimal additional training data and computation

03

Improves generalization with Fourier positional embeddings

Abstract

Open-vocabulary semantic segmentation attempts to classify and outline objects in an image using arbitrary text labels, including those unseen during training. Self-supervised learning resolves numerous visual and linguistic processing problems when effectively trained. This study investigates simple yet efficient methods for adapting previously learned foundation models for open-vocabulary semantic segmentation tasks. Our research proposes "Beyond-Labels", a lightweight transformer-based fusion module that uses a small amount of image segmentation data to fuse frozen visual representations with language concepts. This strategy allows the model to leverage the extensive knowledge of pre-trained models without requiring significant retraining, making the approach data-efficient and scalable. Furthermore, we capture positional information in images using Fourier embeddings, improving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications