TL;DR
This paper introduces LHT-CLIP, a training-free method that enhances CLIP's visual discriminability at multiple levels, significantly improving open-vocabulary semantic segmentation without additional training.
Contribution
LHT-CLIP systematically exploits CLIP's layer, head, and token features to restore visual discriminability, enabling state-of-the-art segmentation performance without training.
Findings
Final layers focus on image-text alignment, reducing visual discriminability.
A subset of attention heads consistently shows strong visual discriminability.
Abnormal tokens have sparse, consistent activation patterns.
Abstract
Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image-level pre-training objectives and the pixel-level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT-CLIP, a novel training-free framework that systematically exploits the visual discriminability of CLIP across layer, head, and token levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image-text alignment with sacrifice of visual discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14), partly due to the emergence of anomalous tokens; (ii) a subset of attention…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. This framework is designed as a set of plug-and-play modules (ATR, SSR, SHE) that can be applied to other existing training-free methods (like SCLIP, ClearCLIP, and ResCLIP). 2. The authors also extend experiments to SigLIP, suggesting transferability of the proposed approaches. 3. The method requires no gradient descent, extra trainable modules, or large-scale retraining. This is a significant advantage over weakly-supervised or adapter-based methods, which require computational and annotati
1. **Hand-crafted heuristics**: The framework feels like a collection of disparate, heavily-tuned heuristics rather than a fundamental and unified solution: ATR identifies tokens by a specific sparsity threshold; SSR requires manually identifying a model-specific range of layers; SHE requires a pre-computed list of "good" heads identified by averaging discriminability scores across multiple datasets and more operations. The pipeline is complex and sensitive to hyperparameters. 2. **Misleading cl
- The proposed techniques are conceptually simple yet provide novel and valuable insights into leveraging CLIP’s internal structures—such as tokens, layers, and attention heads—for dense prediction tasks. - The paper presents a fresh and in-depth examination of CLIP’s internal mechanisms, revealing that its final layers tend to emphasize semantic alignment over spatial detail, among other insightful observations. - The manuscript is clearly written and well-organized, with logically structured s
- While the authors claim that the proposed method improves segmentation performance without extensive hyperparameter tuning, it in fact requires tuning four hyperparameters, as indicated in Tables 2, 3, and 4. - The results highlighted in light gray in Tables 2–4 are described as reflecting the optimal settings; however, these values differ across tables (e.g., 28.0, 28.1, and 28.4 mIoU), which may cause confusion. A more appropriate hyperparameter sensitivity analysis should include results ob
- **In-depth analysis**. This work presents a thorough and convincing quantitative and qualitative analysis for each of the three findings. - **Coherent design**. The proposed components (ATR, SSR, and SHE) directly correspond to the three key observations, showcasing a well-reasoned and targeted design. - **Comprehensive experiments**. The proposed method is supported by extensive and detailed experiments that rigorously validate the effectiveness of each module, providing strong evidence for t
- The obervation of anomalous tokens and ATR share similarities with prior work, such as CLIPTrase[1] and SC-CLIP[2]. It would further strengthen the paper if the authors could clearly discuss the conceptual differences from and advantages over these existing approaches - The experiments are primarily focused on methods that use a pure CLIP. However, many state-of-the-art methods achieve superior performance by integrating other visual foundation models and these methods also make use of the fi
The combined LHT-CLIP module consistently improves the performance of existing training-free methods (like SCLIP, ClearCLIP, and ResCLIP) and achieves better results on eight semantic segmentation benchmarks.
### Major **1. Limited Novelty** The core problems addressed in this paper (performance degradation in final layers, discriminability of specific attention heads, and the emergence of "abnormal tokens") have been previously identified and studied in prior work. It appears the authors have combined these known issues and applied relatively simple or existing techniques to address them. Consequently, the paper's novelty seems limited. 1) The observation that final layers are suboptimal for dens
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
