IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering

Connor Shorten; Augustas Skaburskas; Daniel M. Jones; Charles Pierse; Roberto Esposito; John Trengrove; Etienne Dilocker; Bob van Luijt

arXiv:2602.17687·cs.IR·February 23, 2026

IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering

Connor Shorten, Augustas Skaburskas, Daniel M. Jones, Charles Pierse, Roberto Esposito, John Trengrove, Etienne Dilocker, Bob van Luijt

PDF

Open Access 2 Datasets

TL;DR

This paper introduces IRPAPERS, a benchmark dataset for scientific document retrieval and question answering using both visual and text modalities, comparing their effectiveness and exploring multimodal hybrid approaches.

Contribution

It presents a new dataset with paired visual and OCR text data, evaluates multimodal retrieval and QA systems, and analyzes their complementary strengths and limitations.

Findings

01

Multimodal hybrid search outperforms single-modality retrieval.

02

Text-based systems generally outperform image-based QA, but combining modalities improves results.

03

Open-source image embeddings can surpass some proprietary text embeddings in retrieval accuracy.

Abstract

AI systems have achieved remarkable success in processing text and relational data, yet visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer retrieval and generation directly from document images. This raises a key question: How do image-based systems compare to established text-based methods? We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page. Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems. Text retrieval using Arctic 2.0 embeddings, BM25, and hybrid text search achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval reaches 43%,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling