VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding

Jian Chen; Ming Li; Jihyung Kil; Chenguang Wang; Tong Yu; Ryan Rossi; Tianyi Zhou; Changyou Chen; and Ruiyi Zhang

arXiv:2508.07493·cs.CV·August 26, 2025

VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding

Jian Chen, Ming Li, Jihyung Kil, Chenguang Wang, Tong Yu, Ryan Rossi, Tianyi Zhou, Changyou Chen, and Ruiyi Zhang

PDF

2 Datasets 3 Reviews

TL;DR

VisR-Bench is a comprehensive multilingual benchmark for evaluating question-driven multimodal retrieval in long documents, addressing gaps in existing datasets by including diverse languages, question types, and challenging query types.

Contribution

We introduce VisR-Bench, a new multilingual benchmark with 35K QA pairs across 1.2K documents, enabling detailed evaluation of various retrieval models for long, multimodal documents.

Findings

01

MLLMs outperform other models in retrieval tasks.

02

Models struggle with structured tables and low-resource languages.

03

The benchmark reveals key challenges in multilingual visual retrieval.

Abstract

Most organizational data in this world are stored as documents, and visual retrieval plays a crucial role in unlocking the collective intelligence from all these documents. However, existing benchmarks focus on English-only document retrieval or only consider multilingual question-answering on a single-page image. To bridge this gap, we introduce VisR-Bench, a multilingual benchmark designed for question-driven multimodal retrieval in long documents. Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents, enabling fine-grained evaluation of multimodal retrieval. VisR-Bench spans sixteen languages with three question types (figures, text, and tables), offering diverse linguistic and question coverage. Unlike prior datasets, we include queries without explicit answers, preventing models from relying on superficial keyword matching. We evaluate various retrieval…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- The paper explores a significant challenge of multimodal retrieval for multimodal RAG. - The benchmark includes not only English, but also 15 non-English test samples. - The paper is clear and well presented, which is easy to understand.

Weaknesses

The benchmark was constructed using several strong assumptions, which could lead to biases and inaccuracies in the evaluation results. - When feeding the documents into LLMs to derive the corresponding QA pairs, no matter figure-, table-, or text-based QA pairs, the input documents are assumed as the oracle. This seems reasonable, but there might be other documents (not the input documents) can lead to the correct answer. - In terms of the heuristics that enforce figure-, table-, or text-relat

Reviewer 02Rating 2Confidence 2

Strengths

1. Comprehensive model coverage: The authors evaluate a diverse set of retrieval models, including text-only, multimodal encoders, and MLLM-based approaches. 2. Strong clarity and structure: The paper is well written, logically organized, and easy to follow, making its experimental design and contributions accessible. 3. Data curation: The paper notes that all documents underwent human validation to ensure exclusion of harmful content and PII, enhancing dataset safety and reliability. 4. Data

Weaknesses

1. Limited evaluation of reasoning-capable MLLMs. It would strengthen the analysis to include recent reasoning-optimized MLLMs (e.g., OpenAI o3), even on a small, hard subset. These models could provide deeper insights into reasoning gaps and multimodal generalization. 2. Over-reliance on Top-1 accuracy. The use of Top-1 retrieval accuracy as the primary metric may overstate model weaknesses. While the paper reports a Top-1 accuracy of ~75.2%, the Top-5 accuracy reaches 94.1%, suggesting retrie

Reviewer 03Rating 6Confidence 4

Strengths

1. The proposed benchmark is the first multilingual long-document retrieval benchmark, making it a valuable contribution to the study of multilingual multimodal evaluation. 2. The experiments and analyses are comprehensive, covering different categories of models and clearly demonstrating the necessity of MLLMs for multilingual long-document understanding.

Weaknesses

1. While the benchmark covers multiple languages, it includes only phonographic languages and lacks logographic ones such as Chinese, which limits its linguistic diversity and generalization scope. 2. Given that Gemini is well known for its strong long-context and multilingual capabilities, its absence in the evaluation is notable.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.