DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency

Shuyang Jiang; Nan Yu; Yiming Zhang; Zenghui Ding; Zhenyu Wu

arXiv:2605.06592·cs.CV·May 8, 2026

DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency

Shuyang Jiang, Nan Yu, Yiming Zhang, Zenghui Ding, Zhenyu Wu

PDF

TL;DR

DINORANKCLIP introduces a novel vision-language pretraining framework that combines high-order ranking consistency with DINOv3 distillation, significantly improving fine-grained and out-of-distribution visual understanding.

Contribution

It jointly addresses the limitations of contrastive loss and global pooling by injecting a DINOv3 teacher and developing a high-order ranking model, achieving superior performance.

Findings

01

Outperforms CLIP, CyCLIP, ALIP, and RANKCLIP under similar compute.

02

Largest gains observed in fine-grained and out-of-distribution tasks.

03

Optimal ranking order identified as R* = 3 across benchmarks.

Abstract

Contrastive language-image pretraining (CLIP) suffers from two structural weaknesses: the symmetric InfoNCE loss discards the relative ordering among unmatched in-batch pairs, and global pooling collapses the visual representation into a semantic bottleneck that is poorly sensitive to fine-grained local structure. RANKCLIP partially addresses the first issue with a list-wise Plackett-Luce ranking-consistency loss, but its model is strictly first-order and inherits the second weakness untouched. We propose DINORANKCLIP, a pretraining framework that addresses both jointly. Our principal contribution is injecting a frozen DINOv3 teacher into the contrastive trunk through a dual-branch lightweight student and a multi-scale fusion module with channel-spatial attention, a self-attention refiner, and a conflict-aware gate that preserves the cross-modal alignment up to first order.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.