A Survey on Training-free Open-Vocabulary Semantic Segmentation

Naomi Kombol; Ivan Martinovi\'c; Sini\v{s}a \v{S}egvi\'c

arXiv:2505.22209·cs.CV·May 29, 2025

A Survey on Training-free Open-Vocabulary Semantic Segmentation

Naomi Kombol, Ivan Martinovi\'c, Sini\v{s}a \v{S}egvi\'c

PDF

Open Access

TL;DR

This survey reviews training-free open-vocabulary semantic segmentation methods that leverage existing multi-modal models, highlighting recent approaches, limitations, and future research directions in this evolving field.

Contribution

It provides a comprehensive overview of over 30 recent training-free methods for open-vocabulary segmentation using multi-modal models, categorizing approaches and discussing future challenges.

Findings

01

Most approaches are based on CLIP or similar models.

02

Leveraging auxiliary visual foundation models enhances segmentation performance.

03

Current methods face limitations in accuracy and generalization.

Abstract

Semantic segmentation is one of the most fundamental tasks in image understanding with a long history of research, and subsequently a myriad of different approaches. Traditional methods strive to train models up from scratch, requiring vast amounts of computational resources and training data. In the advent of moving to open-vocabulary semantic segmentation, which asks models to classify beyond learned categories, large quantities of finely annotated data would be prohibitively expensive. Researchers have instead turned to training-free methods where they leverage existing models made for tasks where data is more easily acquired. Specifically, this survey will cover the history, nuance, idea development and the state-of-the-art in training-free open-vocabulary semantic segmentation that leverages existing multi-modal classification models. We will first give a preliminary on the task…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning