Document-as-Image Representations Fall Short for Scientific Retrieval
Ghazal Khalighinejad, Raghuveer Thirukovalluru, Alexander H. Oh, Bhuwan Dhingra

TL;DR
This paper demonstrates that text-based and multimodal representations outperform image-based document embeddings in scientific retrieval, especially for structured, text-rich documents, using a new LaTeX-based benchmark.
Contribution
Introduces ArXivDoc, a LaTeX source-based benchmark for scientific retrieval, and systematically compares different representation modalities, highlighting the limitations of document-as-image approaches.
Findings
Document-as-image representations are consistently suboptimal, especially for longer documents.
Text-based representations outperform image-based methods even for figure queries.
Interleaved text+image representations outperform pure document-as-image approaches.
Abstract
Many recent document embedding models are trained on document-as-image representations, embedding rendered pages as images rather than the underlying source. Meanwhile, existing benchmarks for scientific document retrieval, such as ArXivQA and ViDoRe, treat documents as images of pages, implicitly favoring such representations. In this work, we argue that this paradigm is not well-suited for text-rich multimodal scientific documents, where critical evidence is distributed across structured sources, including text, tables, and figures. To study this setting, we introduce ArXivDoc, a new benchmark constructed from the underlying LaTeX sources of scientific papers. Unlike PDF or image-based representations, LaTeX provides direct access to structured elements (e.g., sections, tables, figures, equations), enabling controlled query construction grounded in specific evidence types. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
