InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer

Muyao Yuan; Yuanhong Zhang; Weizhan Zhang; Lan Ma; Yuan Gao; Jiangyong Ying; Yudeng Xin

arXiv:2511.15967·cs.CV·November 21, 2025

InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer

Muyao Yuan, Yuanhong Zhang, Weizhan Zhang, Lan Ma, Yuan Gao, Jiangyong Ying, Yudeng Xin

PDF

Open Access 1 Video

TL;DR

This paper introduces InfoCLIP, an information-theoretic approach that improves open-vocabulary semantic segmentation by transferring and stabilizing vision-language alignment from pretrained CLIP during fine-tuning.

Contribution

We propose a novel mutual information-based framework, InfoCLIP, to transfer and stabilize CLIP's vision-language alignment for better segmentation performance.

Findings

01

Enhanced segmentation accuracy across benchmarks.

02

Superior transfer of semantic relations compared to existing methods.

03

Robustness to overfitting during fine-tuning.

Abstract

Recently, the strong generalization ability of CLIP has facilitated open-vocabulary semantic segmentation, which labels pixels using arbitrary text. However, existing methods that fine-tune CLIP for segmentation on limited seen categories often lead to overfitting and degrade the pretrained vision-language alignment. To stabilize modality alignment during fine-tuning, we propose InfoCLIP, which leverages an information-theoretic perspective to transfer alignment knowledge from pretrained CLIP to the segmentation task. Specifically, this transfer is guided by two novel objectives grounded in mutual information. First, we compress the pixel-text modality alignment from pretrained CLIP to reduce noise arising from its coarse-grained local semantic representations learned under image-text supervision. Second, we maximize the mutual information between the alignment knowledge of pretrained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling