A Simple Framework for Open-Vocabulary Zero-Shot Segmentation

Thomas Stegm\"uller; Tim Lebailly; Nikola Dukic; Behzad Bozorgtabar; Tinne Tuytelaars; Jean-Philippe Thiran

arXiv:2406.16085·cs.CV·October 16, 2025

A Simple Framework for Open-Vocabulary Zero-Shot Segmentation

Thomas Stegm\"uller, Tim Lebailly, Nikola Dukic, Behzad Bozorgtabar, Tinne Tuytelaars, Jean-Philippe Thiran

PDF

Open Access 1 Video 3 Reviews

TL;DR

SimZSS introduces a straightforward approach for open-vocabulary zero-shot segmentation that leverages frozen vision models and linguistic cues, achieving rapid training and state-of-the-art results on multiple benchmarks.

Contribution

The paper presents a simple, effective framework that enhances zero-shot segmentation by combining frozen vision models with linguistic localization, requiring only image-caption data.

Findings

01

Achieves state-of-the-art performance on 7 out of 8 benchmarks.

02

Requires only image-caption pairs for training.

03

Trains in less than 15 minutes on large datasets.

Abstract

Zero-shot classification capabilities naturally arise in models trained within a vision-language contrastive framework. Despite their classification prowess, these models struggle in dense tasks like zero-shot open-vocabulary segmentation. This deficiency is often attributed to the absence of localization cues in captions and the intertwined nature of the learning process, which encompasses both image representation learning and cross-modality alignment. To tackle these issues, we propose SimZSS, a Simple framework for open-vocabulary Zero-Shot Segmentation. The method is founded on two key principles: i) leveraging frozen vision-only models that exhibit spatial awareness while exclusively aligning the text encoder and ii) exploiting the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions. By capitalizing on the quality of the visual…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 4

Strengths

The experimental performance is good.

Weaknesses

1. The novelty and contribution is limited. First, I prefer simple method. I believe that the simple method is more valuable for applications and research. However, simple method requires more deep analysis and insights. Unfortunately, this work only presents a simple method without any insights. The method is designed without motivation or explanation. This paper looks like a experiment report, rather than a research paper. For a good research paper, the authors need to tell new insights, rathe

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper is easy to understand and follow. 2. The method is straightforward, easy to implement, and can be readily adapted to various backbones and training datasets. 3. The method can be trained without the supervision of semantic masks, reducing the burden of annotations. 4. The motivation for proposing a concept-level objective is clear and more suitable compared to a contrastive objective in scenarios where concepts encode individual objects that are likely to occur multiple times within

Weaknesses

1. It is unclear how the final segmentation masks are generated during inference. Is there a similarity threshold used to determine the class names to which visual tokens belong? If so, how does the performance vary with different threshold settings? 2. There is a lack of analysis explaining why SimZSS outperforms other zero-shot semantic segmentation methods.

Reviewer 03Rating 6Confidence 4

Strengths

- The paper is well-written and easy to follow. The authors provide clear explanations of their methodology. I would like to highlight the quality of the figures and tables in a visually pleasing way that enhances the understanding of the content. - While the problem of localization in vision-language models is not novel, the proposed approach offers a novel perspective. By freezing the vision backbone and only training the text encoder, the authors leverage pretrained self-supervised models ef

Weaknesses

1. It is surprising that the main paper lacks essential details about the training data and the pretrained models used. Given that the paper is only 9 pages (below the 10-page limit), including this information in the main text is necessary. 2. (a) One major limitation is the use of a predefined concept bamk. The authors claim that it does not impact the breadth of the concept the model can localize(sec. 4.4). However, in Table 4, removing PascalVOC classes from the concept bank decreases the

Videos

A Simple Framework for Open-Vocabulary Zero-Shot Segmentation· slideslive

Taxonomy

TopicsNuclear Materials and Properties