Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation

Chao Li; Tianhong Li; Sai Vidyaranya Nuthalapati; Hong-You Chen; Satya Narayan Shukla; Jianpeng Cheng; Yonghuan Yang; Jun Xiao; Xiangjun Fan; Aashu Singh; Dina Katabi; Shlok Kumar Mishra

arXiv:2603.02667·cs.CV·May 19, 2026

Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation

Chao Li, Tianhong Li, Sai Vidyaranya Nuthalapati, Hong-You Chen, Satya Narayan Shukla, Jianpeng Cheng, Yonghuan Yang, Jun Xiao, Xiangjun Fan, Aashu Singh, Dina Katabi, Shlok Kumar Mishra

PDF

1 Repo

TL;DR

DREAM introduces a unified framework that combines contrastive learning and generative modeling for visual understanding and text-to-image generation, using a novel masking schedule to enable joint training.

Contribution

The paper proposes Masking Warmup, a training schedule that allows a single encoder to optimize both contrastive and generative objectives simultaneously.

Findings

01

DREAM outperforms CLIP and FLUID on multiple benchmarks.

02

Semantically Aligned Decoding improves image scoring and generation efficiency.

03

Unified training enhances both visual understanding and image generation performance.

Abstract

Unifying text-image contrastive learning and text-to-image (T2I) generation in a single end-to-end model is challenging because the two objectives demand opposing masking regimes: contrastive alignment needs near-complete visible tokens, while masked generative modeling needs heavy corruption. We introduce DREAM, a unified framework that resolves this conflict through Masking Warmup, a schedule that shifts the center of the masking distribution over training, so low and high masking ratios coexist at every step. This co-exposure lets a single jointly-trained encoder serve both objectives. The resulting stable optimization unlocks Semantically Aligned Decoding at inference: the text encoder, trained against visual embeddings at all masking ratios, can score partially generated images and select the best trajectory with as little as 12.5% of the image decoded, improving both FID and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chaoli-charlie/dream
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning