Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
Wenjin Wang, Yunhao Li, Yixin Ou, Yin Zhang

TL;DR
This paper introduces LATIN-Prompt and LATIN-Tuning, methods that enable off-the-shelf instruction models to understand document layouts and improve zero-shot document question answering performance.
Contribution
It proposes a novel layout and task-aware prompting and tuning approach that allows existing instruction models to excel in document image question answering without extensive pre-training.
Findings
LATIN-Prompt makes Claude and ChatGPT comparable to fine-tuned SOTA models.
LATIN-Tuning significantly boosts Alpaca's zero-shot performance on DocVQA.
The methods demonstrate strong quantitative and qualitative improvements.
Abstract
Layout-aware pre-trained models has achieved significant progress on document image question answering. They introduce extra learnable modules into existing language models to capture layout information within document images from text bounding box coordinates obtained by OCR tools. However, extra modules necessitate pre-training on extensive document images. This prevents these methods from directly utilizing off-the-shelf instruction-tuning language foundation models, which have recently shown promising potential in zero-shot learning. Instead, in this paper, we find that instruction-tuning language models like Claude and ChatGPT can understand layout by spaces and line breaks. Based on this observation, we propose the LAyout and Task aware Instruction Prompt (LATIN-Prompt), which consists of layout-aware document content and task-aware instruction. Specifically, the former uses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Handwritten Text Recognition Techniques
MethodsALIGN · Attentive Walk-Aggregating Graph Neural Network
