Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

Tao Xu

arXiv:2602.14162·cs.CL·February 27, 2026

Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

Tao Xu

PDF

Open Access

TL;DR

This paper introduces Deferred Visual Ingestion (DVI), a novel framework for visual-dense document question answering that improves accuracy by deferring visual analysis to inference time and leveraging document structure for indexing.

Contribution

The paper proposes DVI, which avoids VLM calls during preprocessing, uses hierarchical indexing based on document structure, and targets visual analysis during inference, significantly enhancing QA performance on engineering documents.

Findings

01

DVI achieves 65.6% accuracy on Bridge engineering drawings, outperforming PI's 24.3%.

02

On Steel catalog, DVI reaches 30.6% accuracy, surpassing 16.1%.

03

On CircuitVQA, DVI's retrieval ImgR@3 is 31.2%, much higher than 0.7%.

Abstract

Existing multimodal document question answering methods predominantly adopt a Pre-Ingestion (PI) strategy: during the indexing phase, a Vision Language Model (VLM) is called on every page to generate page descriptions that are then encoded into vectors, and questions are answered via embedding similarity retrieval. However, this approach faces a dual dilemma on visual-dense engineering documents: VLM blind descriptions inevitably lose critical visual details, and embedding retrieval systematically fails on highly similar documents. This paper proposes the Deferred Visual Ingestion (DVI) framework: zero VLM calls during preprocessing, leveraging only document structural information (table of contents, drawing numbers) to automatically build a hierarchical index through the HDNC (Hierarchical Drawing Number Clustering) algorithm; during inference, candidate pages are located via BM25…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques