Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation
Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Xilin Zhao, Qingming Huang

TL;DR
This paper proposes a novel diffusion contrastive reconstruction method that integrates contrastive signals into diffusion-based image reconstruction to enhance the visual representation capacity of CLIP, balancing discriminative and detail perceptual abilities.
Contribution
It introduces DCR, a unified framework that combines contrastive learning with diffusion models to improve CLIP's visual representations, addressing its limitations in class separability and fine-grained detail perception.
Findings
DCR improves downstream task performance across benchmarks.
Contrastive signals from reconstructed images enhance representation quality.
Theoretical analysis confirms joint optimization of D-Ability and P-Ability.
Abstract
The limited understanding capacity of the visual encoder in Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance. This capacity includes both Discriminative Ability (D-Ability), which reflects class separability, and Detail Perceptual Ability (P-Ability), which focuses on fine-grained visual cues. Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens. We argue that such paradigms may compromise D-Ability and therefore fail to effectively address CLIP's representation limitations. To address this, we integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations. We begin with a straightforward design that augments the diffusion process with contrastive learning on input images. However, empirical results show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
