Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

Weiqing Li; Jinyue Guo; Yaqi Wang; Haiyang Xiao; Yuewei Zhang; Guohua Liu; Hao Henry Wang

arXiv:2603.16455·cs.CV·March 18, 2026

Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

Weiqing Li, Jinyue Guo, Yaqi Wang, Haiyang Xiao, Yuewei Zhang, Guohua Liu, Hao Henry Wang

PDF

Open Access

TL;DR

Evo-Retriever introduces an LLM-guided curriculum evolution framework with viewpoint-pathway collaboration to improve multimodal document retrieval, achieving state-of-the-art results on ViDoRe V2 and MMEB datasets.

Contribution

The paper presents a novel Evo-Retriever framework that dynamically adapts training strategies using LLM guidance and multi-view alignment for better cross-modal retrieval.

Findings

01

Achieves state-of-the-art nDCG@5 scores of 65.2% on ViDoRe V2

02

Achieves state-of-the-art nDCG@5 scores of 77.1% on MMEB

03

Enhances fine-grained matching through multi-view image alignment

Abstract

Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model's dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates "hard queries" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques