PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
Xudong Xie, Hao Yan, Liang Yin, Yang Liu, Jing Ding, Minghui Liao, Yuliang Liu, Wei Chen, Xiang Bai

TL;DR
PDF-WuKong is a multimodal large language model designed for efficient long PDF reading, utilizing sparse sampling to improve question-answering on lengthy academic papers with text and images.
Contribution
Introduces PDF-WuKong, a novel multimodal model with a sparse sampler for improved long PDF comprehension and QA, supported by a new dataset PaperPDF with 1.1 million QA pairs.
Findings
Outperforms existing models on long multimodal document understanding by 8.6% F1.
Uses a sparse sampler to select relevant text and images, enhancing efficiency.
Constructed PaperPDF dataset with extensive QA pairs for training and evaluation.
Abstract
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. However, existing methods typically focus on either plain text or a limited number of document images, struggling to handle long PDF documents with interleaved text and images, especially for academic papers. In this paper, we introduce PDF-WuKong, a multimodal large language model (MLLM) that is designed to enhance multimodal question-answering (QA) for long PDF documents. PDF-WuKong incorporates a sparse sampler that operates on both text and image representations, significantly improving the efficiency and capability of the MLLM. The sparse sampler selects the paragraphs or diagrams most pertinent to user queries. To effectively train and evaluate our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
