Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation

Jingyun Wang; Cilin Yan; Guoliang Kang

arXiv:2502.06818·cs.LG·May 12, 2026

Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation

Jingyun Wang, Cilin Yan, Guoliang Kang

PDF

TL;DR

This paper introduces GCLIP, a method to enhance global knowledge extraction in CLIP for training-free open-vocabulary semantic segmentation, improving dense prediction performance.

Contribution

GCLIP reshapes CLIP's attention and value embeddings to better utilize global context without sacrificing local detail, advancing TF-OVSS capabilities.

Findings

01

Outperforms previous state-of-the-art on five benchmarks.

02

Effectively integrates global context into dense prediction.

03

Enhances CLIP's ability for open-vocabulary segmentation without training.

Abstract

Recent works modify CLIP to perform open-vocabulary semantic segmentation in a training-free manner (TF-OVSS). In vanilla CLIP, patch-wise image representations mainly encode homogeneous image-level properties, which hinders the application of CLIP to the dense prediction task. Previous TF-OVSS works sacrifice globality to enhance the locality of CLIP features, by making each patch mainly attend to itself or its neighboring patches within a narrow local window. With their modifications,the ability of CLIP to aggregate global context information is largely weakened. Differently, in this paper, we rethink the global knowledge encoded by CLIP and propose GCLIP to answer how to extract and utilize beneficial global knowledge of CLIP for TF-OVSS. As the representation of each patch is finally determined by the attention weights and the Value embeddings, we propose to reshape the last-block…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.