PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based   Visual Question Answering

Yihao Ding; Kaixuan Ren; Jiabin Huang; Siwen Luo; Soyeon Caren Han

arXiv:2404.12720·cs.CV·April 22, 2024

PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering

Yihao Ding, Kaixuan Ren, Jiabin Huang, Siwen Luo, Soyeon Caren Han

PDF

Open Access

TL;DR

This paper introduces PDF-MVQA, a new dataset and framework for multimodal information retrieval in multi-page, text-heavy research articles, addressing the challenge of understanding hierarchical semantic relations in visually-rich documents.

Contribution

The paper presents a comprehensive PDF Document VQA dataset and novel VRD-QA frameworks that capture textual content and layout relations across multiple pages.

Findings

01

The dataset enables examination of hierarchical layout structures in research articles.

02

The VRD-QA frameworks improve understanding of multi-page, multimodal documents.

03

Enhanced vision-and-language models for text-dominant documents.

Abstract

Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly those dominated by lengthy textual content like research journal articles. Existing studies primarily focus on real-world documents with sparse text, while challenges persist in comprehending the hierarchical semantic relations among multiple pages to locate multimodal components. To address this gap, we propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval. Unlike traditional machine reading comprehension (MRC) tasks, our approach aims to retrieve entire paragraphs containing answers or visually rich document entities like tables and figures. Our contributions include the introduction of a comprehensive PDF Document VQA dataset, allowing the examination of semantically hierarchical layout…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsFocus