Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu

TL;DR
This paper introduces Deferred Visual Ingestion (DVI), a novel framework for visual-dense document question answering that improves accuracy by deferring visual analysis to inference time and leveraging document structure for indexing.
Contribution
The paper proposes DVI, which avoids VLM calls during preprocessing, uses hierarchical indexing based on document structure, and targets visual analysis during inference, significantly enhancing QA performance on engineering documents.
Findings
DVI achieves 65.6% accuracy on Bridge engineering drawings, outperforming PI's 24.3%.
On Steel catalog, DVI reaches 30.6% accuracy, surpassing 16.1%.
On CircuitVQA, DVI's retrieval ImgR@3 is 31.2%, much higher than 0.7%.
Abstract
Existing multimodal document question answering methods predominantly adopt a Pre-Ingestion (PI) strategy: during the indexing phase, a Vision Language Model (VLM) is called on every page to generate page descriptions that are then encoded into vectors, and questions are answered via embedding similarity retrieval. However, this approach faces a dual dilemma on visual-dense engineering documents: VLM blind descriptions inevitably lose critical visual details, and embedding retrieval systematically fails on highly similar documents. This paper proposes the Deferred Visual Ingestion (DVI) framework: zero VLM calls during preprocessing, leveraging only document structural information (table of contents, drawing numbers) to automatically build a hierarchical index through the HDNC (Hierarchical Drawing Number Clustering) algorithm; during inference, candidate pages are located via BM25…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques
