Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation

Daniel Csizmadia; Andrei Codreanu; Victor Sim; Vighnesh Prabhu; Michael Lu; Kevin Zhu; Sean O'Brien; Vasu Sharma

arXiv:2505.21549·cs.CV·June 17, 2025

Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation

Daniel Csizmadia, Andrei Codreanu, Victor Sim, Vighnesh Prabhu, Michael Lu, Kevin Zhu, Sean O'Brien, Vasu Sharma

PDF

Open Access

TL;DR

DCLIP is a fine-tuned CLIP variant that improves image-text retrieval accuracy through cross-modal distillation, while maintaining most of CLIP's zero-shot classification capabilities, using a small curated dataset.

Contribution

DCLIP introduces a novel cross-modal transformer distillation framework that enhances retrieval performance with limited training data.

Findings

01

Significantly improves retrieval metrics like Recall@K and MAP.

02

Retains approximately 94% of CLIP's zero-shot classification performance.

03

Achieves these results with only ~67,500 training samples.

Abstract

We present Distill CLIP (DCLIP), a fine-tuned variant of the CLIP model that enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities. CLIP models are typically constrained by fixed image resolutions and limited context, which can hinder their effectiveness in retrieval tasks that require fine-grained cross-modal understanding. DCLIP addresses these challenges through a meta teacher-student distillation framework, where a cross-modal transformer teacher is fine-tuned to produce enriched embeddings via bidirectional cross-attention between YOLO-extracted image regions and corresponding textual spans. These semantically and spatially aligned global representations guide the training of a lightweight student model using a hybrid loss that combines contrastive learning and cosine similarity objectives. Despite being trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training · Contrastive Learning