Document Collection Visual Question Answering

Rub\`en Tito; Dimosthenis Karatzas; Ernest Valveny

arXiv:2104.14336·cs.IR·April 4, 2023

Document Collection Visual Question Answering

Rub\`en Tito, Dimosthenis Karatzas, Ernest Valveny

PDF

TL;DR

This paper introduces Document Collection Visual Question Answering (DocCVQA), a new dataset and task that involves answering questions based on collections of document images and retrieving relevant documents, enhancing document understanding.

Contribution

The paper presents a novel dataset, task, evaluation metric, and baselines for document collection visual question answering, addressing the limitation of current methods focusing on single documents.

Findings

01

New dataset and task for document collection VQA

02

Proposed evaluation metric and baseline models

03

Insights into the challenges of multi-document understanding

Abstract

Current tasks and methods in Document Understanding aims to process documents as single elements. However, documents are usually organized in collections (historical records, purchase invoices), that provide context useful for their interpretation. To address this problem, we introduce Document Collection Visual Question Answering (DocCVQA) a new dataset and related task, where questions are posed over a whole collection of document images and the goal is not only to provide the answer to the given question, but also to retrieve the set of documents that contain the information needed to infer the answer. Along with the dataset we propose a new evaluation metric and baselines which provide further insights to the new dataset and task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.