MIRACL-VISION: A Large, multilingual, visual document retrieval benchmark
Radek Osmulski, Gabriel de Souza P. Moreira, Ronay Ak, Mengyao Xu, Benedikt Schifferer, Even Oldridge

TL;DR
MIRACL-VISION is a comprehensive multilingual benchmark designed to evaluate visual document retrieval models across 18 languages, addressing limitations of existing benchmarks and highlighting the performance gap between visual and text-based retrieval methods.
Contribution
It introduces MIRACL-VISION, a new multilingual visual document retrieval benchmark based on the MIRACL dataset, with a novel method to filter easy negatives for more challenging evaluation.
Findings
Visual models perform up to 59.7% worse than text models in multilingual retrieval.
Even in English, visual models lag behind text-based models by 12.1%.
MIRACL-VISION provides a challenging benchmark for developing robust visual retrieval models.
Abstract
Document retrieval is an important task for search and Retrieval-Augmented Generation (RAG) applications. Large Language Models (LLMs) have contributed to improving the accuracy of text-based document retrieval. However, documents with complex layout and visual elements like tables, charts and infographics are not perfectly represented in textual format. Recently, image-based document retrieval pipelines have become popular, which use visual large language models (VLMs) to retrieve relevant page images given a query. Current evaluation benchmarks on visual document retrieval are limited, as they primarily focus only English language, rely on synthetically generated questions and offer a small corpus size. Therefore, we introduce MIRACL-VISION, a multilingual visual document retrieval evaluation benchmark. MIRACL-VISION covers 18 languages, and is an extension of the MIRACL dataset, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsFocus
