NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval
Zhuchenyang Liu, Yao Zhang, Yu Xiao

TL;DR
NanoVDR introduces a novel approach to visual document retrieval by distilling a large vision-language model into a small, efficient text-only encoder, significantly reducing latency and computational costs while maintaining high retrieval quality.
Contribution
The paper proposes a decoupled, distillation-based method that simplifies visual document retrieval, achieving near-teacher performance with a much smaller model and lower inference latency.
Findings
Pointwise cosine alignment outperforms other distillation objectives.
Data augmentation with machine translation improves cross-lingual transfer.
NanoVDR-S-Multi retains 95.1% of teacher quality with 32x fewer parameters.
Abstract
Vision-Language Model (VLM) based retrievers have advanced visual document retrieval (VDR) to impressive quality. They require the same multi-billion parameter encoder for both document indexing and query encoding, incurring high latency and GPU dependence even for plain-text queries. We observe that this design is unnecessarily symmetric: documents are visually complex and demand strong visual understanding, whereas queries are just short text strings. NanoVDR exploits this query--document asymmetry by decoupling the two encoding paths: a frozen 2B VLM teacher indexes documents offline, while a distilled text-only student as small as 69M parameters encodes queries at inference. The key design choice is the distillation objective. Through systematic comparison of six objectives across three backbones and 22 ViDoRe benchmark datasets, we find that pointwise cosine alignment on query text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
