GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining

Deen Dayal Mohan; Hossein Souri; Vitali Petsiuk; Juhong Min; Gopal Sharma; Luowei Zhou; Suren Kumar

arXiv:2603.24804·cs.CV·March 27, 2026

GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining

Deen Dayal Mohan, Hossein Souri, Vitali Petsiuk, Juhong Min, Gopal Sharma, Luowei Zhou, Suren Kumar

PDF

Open Access

TL;DR

GoldiCLIP introduces a balanced supervision framework for vision-language pretraining that achieves competitive results with significantly less data by combining innovative training techniques.

Contribution

It presents a novel multifaceted training framework that balances heterogeneous supervision signals, enabling data-efficient vision-language model training with state-of-the-art performance.

Findings

01

Achieves 2.2 points improvement on MSCOCO retrieval

02

Improves 2.0 points on fine-grained retrieval

03

Gains 5.9 points on question-based retrieval

Abstract

Until recently, the success of large-scale vision-language models (VLMs) has primarily relied on billion-sample datasets, posing a significant barrier to progress. Latest works have begun to close this gap by improving supervision quality, but each addresses only a subset of the weaknesses in contrastive pretraining. We present GoldiCLIP, a framework built on a Goldilocks principle of finding the right balance of supervision signals. Our multifaceted training framework synergistically combines three key innovations: (1) a text-conditioned self-distillation method to align both text-agnostic and text-conditioned features; (2) an encoder integrated decoder with Visual Question Answering (VQA) objective that enables the encoder to generalize beyond the caption-like queries; and (3) an uncertainty-based weighting mechanism that automatically balances all heterogeneous losses. Trained on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques