Analyzing the Efficacy of an LLM-Only Approach for Image-based Document   Question Answering

Nidhi Hegde; Sujoy Paul; Gagan Madan; Gaurav Aggarwal

arXiv:2309.14389·cs.CV·September 27, 2023

Analyzing the Efficacy of an LLM-Only Approach for Image-based Document Question Answering

Nidhi Hegde, Sujoy Paul, Gagan Madan, Gaurav Aggarwal

PDF

Open Access

TL;DR

This paper investigates whether an instruction-tuned large language model alone can effectively perform image-based document question answering by directly processing serialized textual information, potentially replacing traditional vision encoders.

Contribution

It introduces an LLM-only approach for document question answering, bypassing vision encoders, and provides a comprehensive quantitative analysis across multiple datasets.

Findings

01

LLM-only approach achieves comparable performance to state-of-the-art methods.

02

Serializing textual info for LLM input is effective for document QA.

03

The approach simplifies the model architecture while maintaining accuracy.

Abstract

Recent document question answering models consist of two key components: the vision encoder, which captures layout and visual elements in images, and a Large Language Model (LLM) that helps contextualize questions to the image and supplements them with external world knowledge to generate accurate answers. However, the relative contributions of the vision encoder and the language model in these tasks remain unclear. This is especially interesting given the effectiveness of instruction-tuned LLMs, which exhibit remarkable adaptability to new tasks. To this end, we explore the following aspects in this work: (1) The efficacy of an LLM-only approach on document question answering tasks (2) strategies for serializing textual information within document images and feeding it directly to an instruction-tuned LLM, thus bypassing the need for an explicit vision encoder (3) thorough quantitative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning