IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering
Connor Shorten, Augustas Skaburskas, Daniel M. Jones, Charles Pierse, Roberto Esposito, John Trengrove, Etienne Dilocker, Bob van Luijt

TL;DR
This paper introduces IRPAPERS, a benchmark dataset for scientific document retrieval and question answering using both visual and text modalities, comparing their effectiveness and exploring multimodal hybrid approaches.
Contribution
It presents a new dataset with paired visual and OCR text data, evaluates multimodal retrieval and QA systems, and analyzes their complementary strengths and limitations.
Findings
Multimodal hybrid search outperforms single-modality retrieval.
Text-based systems generally outperform image-based QA, but combining modalities improves results.
Open-source image embeddings can surpass some proprietary text embeddings in retrieval accuracy.
Abstract
AI systems have achieved remarkable success in processing text and relational data, yet visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer retrieval and generation directly from document images. This raises a key question: How do image-based systems compare to established text-based methods? We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page. Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems. Text retrieval using Arctic 2.0 embeddings, BM25, and hybrid text search achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval reaches 43%,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
