PDFTriage: Question Answering over Long, Structured Documents

Jon Saad-Falcon; Joe Barrow; Alexa Siu; Ani Nenkova; David Seunghyun; Yoon; Ryan A. Rossi; Franck Dernoncourt

arXiv:2309.08872·cs.CL·November 9, 2023·2 cites

PDFTriage: Question Answering over Long, Structured Documents

Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, David Seunghyun, Yoon, Ryan A. Rossi, Franck Dernoncourt

PDF

Open Access

TL;DR

PDFTriage introduces a novel method for question answering over structured documents like PDFs, leveraging document structure for improved retrieval and understanding, outperforming existing retrieval-augmented LLMs.

Contribution

The paper presents PDFTriage, a new approach that incorporates document structure into retrieval for QA, along with a new benchmark dataset of 900+ questions over 80 documents.

Findings

01

PDFTriage improves QA accuracy on structured documents.

02

Structured retrieval outperforms plain text retrieval in experiments.

03

Benchmark dataset enables future research in document QA.

Abstract

Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

Methodsfail · Focus