Toward Modality Gap: Vision Prototype Learning for Weakly-supervised   Semantic Segmentation with CLIP

Zhongxing Xu; Feilong Tang; Zhe Chen; Yingxue Su; Zhiyi Zhao; Ge; Zhang; Jionglong Su; Zongyuan Ge

arXiv:2412.19650·cs.CV·December 30, 2024

Toward Modality Gap: Vision Prototype Learning for Weakly-supervised Semantic Segmentation with CLIP

Zhongxing Xu, Feilong Tang, Zhe Chen, Yingxue Su, Zhiyi Zhao, Ge, Zhang, Jionglong Su, Zongyuan Ge

PDF

Open Access 1 Video

TL;DR

This paper introduces a Vision Prototype Learning framework that addresses the modality gap in CLIP-based weakly supervised semantic segmentation, leading to improved alignment and state-of-the-art results.

Contribution

The paper proposes a novel VPL framework that learns class-specific vision prototypes to better align vision and text features, overcoming the modality gap issue.

Findings

01

Achieves state-of-the-art performance on benchmark datasets.

02

Introduces a regional semantic contrast module for robust feature learning.

03

Provides theoretical analysis of the modality gap impact.

Abstract

The application of Contrastive Language-Image Pre-training (CLIP) in Weakly Supervised Semantic Segmentation (WSSS) research powerful cross-modal semantic understanding capabilities. Existing methods attempt to optimize input text prompts for improved alignment of images and text, by finely adjusting text prototypes to facilitate semantic matching. Nevertheless, given the modality gap between text and vision spaces, the text prototypes employed by these methods have not effectively established a close correspondence with pixel-level vision features. In this work, our theoretical analysis indicates that the inherent modality gap results in misalignment of text and region features, and that this gap cannot be sufficiently reduced by minimizing contrast loss in CLIP. To mitigate the impact of the modality gap, we propose a Vision Prototype Learning (VPL) framework, by introducing more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Toward Modality Gap: Vision Prototype Learning for Weakly-supervised Semantic Segmentation with CLIP· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training