Efficiently Disentangling CLIP for Multi-Object Perception

Samyak Rawlekar; Yujun Cai; Yiwei Wang; Ming-Hsuan Yang; Narendra Ahuja

arXiv:2502.02977·cs.CV·September 26, 2025

Efficiently Disentangling CLIP for Multi-Object Perception

Samyak Rawlekar, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, Narendra Ahuja

PDF

Open Access 3 Reviews

TL;DR

This paper introduces DCLIP, a framework that disentangles features in CLIP to improve multi-object perception, reducing inter-class similarity and enhancing performance in recognition and segmentation tasks with fewer parameters.

Contribution

DCLIP is a novel method that learns optimal mutual information levels in CLIP, improving multi-object perception by disentangling features with minimal additional parameters.

Findings

01

Reduces inter-class similarity by 30%.

02

Outperforms SOTA on VOC2007 and COCO-14 with fewer parameters.

03

Improves zero-shot segmentation across six datasets.

Abstract

Vision-language models like CLIP excel at recognizing the single, prominent object in a scene. However, they struggle in complex scenes containing multiple objects. We identify a fundamental reason for this limitation: VLM feature space exhibits excessive mutual feature information (MFI), where the features of one class contain substantial information about other, unrelated classes. This high MFI becomes evident during class-specific queries, as unrelated objects are activated alongside the queried class. To address this limitation, we propose DCLIP, an efficient framework that learns an optimal level of mutual information while adding only minimal learnable parameters to a frozen VLM. DCLIP uses two complementary losses: a novel MFI Loss that regulates class feature similarity to prevent excessive overlap while preserving necessary shared information, and the Asymmetric Loss (ASL) that…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

This paper investigates the root cause of high inter-class feature entanglement in vision-language models (VLMs) like CLIP, particularly in multi-object scenarios. The authors propose DCLIP, a framework that freezes the original VLM and introduces lightweight image and text projectors. The core idea is to enforce orthogonality among class text embeddings using a Mutual Feature Information (MFI) Loss on the text side, and to align local visual features with these "disentangled" text features usin

Weaknesses

1、The equivalence jump from Eq (1)→(2) rests on a questionable assumption. The paper implies that vector normalization leads to an identity covariance matrix (Σ=I), which allows diagonal terms to be treated as constants. This leap from a unit L2-norm to unit variance & zero covariance is not strictly valid, and the provided "Gaussian + BN" justification is insufficient to bridge this gap. 2、The core mechanism—that text-side orthogonalization improves image-side alignment—lacks theoretical groun

Reviewer 02Rating 6Confidence 4

Strengths

1.The paper introduces the novel concept of mutual feature information to explain why models like CLIP perform poorly on multi-object perception tasks, which is an interesting and valuable idea. 2.The authors provide a clear explanation of the proposed framework and conduct extensive experiments to support their methodology.

Weaknesses

1.Is the MFI only present in the text modality? The paper previously points out that there exists feature entanglement in CLIP’s representation space, but the proposed MFI loss is applied only to the text branch. How is disentanglement achieved for the image features? In the paper, it states: “To mitigate this, we project zi and ti into a new disentangled space using learnable projectors (hϕ : hϕ,img and hϕ,text), parameterized by weights ϕ. These projectors map zi and ti from the original space

Reviewer 03Rating 2Confidence 5

Strengths

1.The proposed method is simple and easy to understand, yet it achieves strong performance without relying on complex or specialized modules. 2.The experimental evaluation is comprehensive, demonstrating that the approach not only improves multi-label image classification but also generalizes effectively to zero-shot semantic segmentation. 3.The paper is well-written and provides detailed methodological explanations, and implementation settings, which make the approach easy to reproduce.

Weaknesses

1.In terms of methodology, this paper aims to make text features as orthogonal as possible in the feature space to enhance their discriminability. However, this idea is not entirely new. In addition, the proposed approach is not specifically designed for multi-label problems, as it does not take into account the correlations between different labels. 2. For multi-label image classification, the competing methods are mainly based on CoOp or CLIP, but these approaches are designed for missing-lab

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Handwritten Text Recognition Techniques · Music and Audio Processing

MethodsContrastive Language-Image Pre-training · ALIGN