Analyzing the Efficacy of an LLM-Only Approach for Image-based Document Question Answering
Nidhi Hegde, Sujoy Paul, Gagan Madan, Gaurav Aggarwal

TL;DR
This paper investigates whether an instruction-tuned large language model alone can effectively perform image-based document question answering by directly processing serialized textual information, potentially replacing traditional vision encoders.
Contribution
It introduces an LLM-only approach for document question answering, bypassing vision encoders, and provides a comprehensive quantitative analysis across multiple datasets.
Findings
LLM-only approach achieves comparable performance to state-of-the-art methods.
Serializing textual info for LLM input is effective for document QA.
The approach simplifies the model architecture while maintaining accuracy.
Abstract
Recent document question answering models consist of two key components: the vision encoder, which captures layout and visual elements in images, and a Large Language Model (LLM) that helps contextualize questions to the image and supplements them with external world knowledge to generate accurate answers. However, the relative contributions of the vision encoder and the language model in these tasks remain unclear. This is especially interesting given the effectiveness of instruction-tuned LLMs, which exhibit remarkable adaptability to new tasks. To this end, we explore the following aspects in this work: (1) The efficacy of an LLM-only approach on document question answering tasks (2) strategies for serializing textual information within document images and feeding it directly to an instruction-tuned LLM, thus bypassing the need for an explicit vision encoder (3) thorough quantitative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
