PDFTriage: Question Answering over Long, Structured Documents
Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, David Seunghyun, Yoon, Ryan A. Rossi, Franck Dernoncourt

TL;DR
PDFTriage introduces a novel method for question answering over structured documents like PDFs, leveraging document structure for improved retrieval and understanding, outperforming existing retrieval-augmented LLMs.
Contribution
The paper presents PDFTriage, a new approach that incorporates document structure into retrieval for QA, along with a new benchmark dataset of 900+ questions over 80 documents.
Findings
PDFTriage improves QA accuracy on structured documents.
Structured retrieval outperforms plain text retrieval in experiments.
Benchmark dataset enables future research in document QA.
Abstract
Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
Methodsfail · Focus
