DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency
Shuyang Jiang, Nan Yu, Yiming Zhang, Zenghui Ding, Zhenyu Wu

TL;DR
DINORANKCLIP introduces a novel vision-language pretraining framework that combines high-order ranking consistency with DINOv3 distillation, significantly improving fine-grained and out-of-distribution visual understanding.
Contribution
It jointly addresses the limitations of contrastive loss and global pooling by injecting a DINOv3 teacher and developing a high-order ranking model, achieving superior performance.
Findings
Outperforms CLIP, CyCLIP, ALIP, and RANKCLIP under similar compute.
Largest gains observed in fine-grained and out-of-distribution tasks.
Optimal ranking order identified as R* = 3 across benchmarks.
Abstract
Contrastive language-image pretraining (CLIP) suffers from two structural weaknesses: the symmetric InfoNCE loss discards the relative ordering among unmatched in-batch pairs, and global pooling collapses the visual representation into a semantic bottleneck that is poorly sensitive to fine-grained local structure. RANKCLIP partially addresses the first issue with a list-wise Plackett-Luce ranking-consistency loss, but its model is strictly first-order and inherits the second weakness untouched. We propose DINORANKCLIP, a pretraining framework that addresses both jointly. Our principal contribution is injecting a frozen DINOv3 teacher into the contrastive trunk through a dual-branch lightweight student and a multi-scale fusion module with channel-spatial attention, a self-attention refiner, and a conflict-aware gate that preserves the cross-modal alignment up to first order.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
