Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective

Jiahao Li; Yang Lu; Yachao Zhang; Yong Xie; Fangyong Wang; Yuan Xie; Yanyun Qu

arXiv:2511.16170·cs.CV·November 21, 2025

Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective

Jiahao Li, Yang Lu, Yachao Zhang, Yong Xie, Fangyong Wang, Yuan Xie, Yanyun Qu

PDF

Open Access

TL;DR

This paper investigates CLIP's internal attention mechanisms in open-vocabulary semantic segmentation, identifying distraction phenomena and proposing a training-free refocusing method that improves dense prediction performance.

Contribution

It systematically analyzes CLIP's interpretability issues in dense prediction and introduces RF-CLIP, a novel attention refocusing approach that enhances segmentation accuracy without additional training.

Findings

01

Achieves state-of-the-art results on eight benchmarks.

02

Identifies dimension-specific over-activation as a distraction source.

03

Proposes a training-free attention refocusing method.

Abstract

Open-vocabulary semantic segmentation (OVSS) employs pixel-level vision-language alignment to associate category-related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel-level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP's vision-language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP's internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension-specific over-activation; filtering them enhances CLIP's dense prediction performance. Consequently, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning