Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation

Boyu Han; Qianqian Xu; Shilong Bao; Zhiyong Yang; Ruochen Cui; Xilin Zhao; Qingming Huang

arXiv:2603.04803·cs.CV·March 23, 2026

Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation

Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Xilin Zhao, Qingming Huang

PDF

Open Access

TL;DR

This paper proposes a novel diffusion contrastive reconstruction method that integrates contrastive signals into diffusion-based image reconstruction to enhance the visual representation capacity of CLIP, balancing discriminative and detail perceptual abilities.

Contribution

It introduces DCR, a unified framework that combines contrastive learning with diffusion models to improve CLIP's visual representations, addressing its limitations in class separability and fine-grained detail perception.

Findings

01

DCR improves downstream task performance across benchmarks.

02

Contrastive signals from reconstructed images enhance representation quality.

03

Theoretical analysis confirms joint optimization of D-Ability and P-Ability.

Abstract

The limited understanding capacity of the visual encoder in Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance. This capacity includes both Discriminative Ability (D-Ability), which reflects class separability, and Detail Perceptual Ability (P-Ability), which focuses on fine-grained visual cues. Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens. We argue that such paradigms may compromise D-Ability and therefore fail to effectively address CLIP's representation limitations. To address this, we integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations. We begin with a straightforward design that augments the diffusion process with contrastive learning on input images. However, empirical results show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis