Toward Modality Gap: Vision Prototype Learning for Weakly-supervised Semantic Segmentation with CLIP
Zhongxing Xu, Feilong Tang, Zhe Chen, Yingxue Su, Zhiyi Zhao, Ge, Zhang, Jionglong Su, Zongyuan Ge

TL;DR
This paper introduces a Vision Prototype Learning framework that addresses the modality gap in CLIP-based weakly supervised semantic segmentation, leading to improved alignment and state-of-the-art results.
Contribution
The paper proposes a novel VPL framework that learns class-specific vision prototypes to better align vision and text features, overcoming the modality gap issue.
Findings
Achieves state-of-the-art performance on benchmark datasets.
Introduces a regional semantic contrast module for robust feature learning.
Provides theoretical analysis of the modality gap impact.
Abstract
The application of Contrastive Language-Image Pre-training (CLIP) in Weakly Supervised Semantic Segmentation (WSSS) research powerful cross-modal semantic understanding capabilities. Existing methods attempt to optimize input text prompts for improved alignment of images and text, by finely adjusting text prototypes to facilitate semantic matching. Nevertheless, given the modality gap between text and vision spaces, the text prototypes employed by these methods have not effectively established a close correspondence with pixel-level vision features. In this work, our theoretical analysis indicates that the inherent modality gap results in misalignment of text and region features, and that this gap cannot be sufficiently reduced by minimizing contrast loss in CLIP. To mitigate the impact of the modality gap, we propose a Vision Prototype Learning (VPL) framework, by introducing more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
