PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

Xudong Xie; Hao Yan; Liang Yin; Yang Liu; Jing Ding; Minghui Liao; Yuliang Liu; Wei Chen; Xiang Bai

arXiv:2410.05970·cs.CV·April 28, 2026

PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

Xudong Xie, Hao Yan, Liang Yin, Yang Liu, Jing Ding, Minghui Liao, Yuliang Liu, Wei Chen, Xiang Bai

PDF

2 Repos 1 Datasets

TL;DR

PDF-WuKong is a multimodal large language model designed for efficient long PDF reading, utilizing sparse sampling to improve question-answering on lengthy academic papers with text and images.

Contribution

Introduces PDF-WuKong, a novel multimodal model with a sparse sampler for improved long PDF comprehension and QA, supported by a new dataset PaperPDF with 1.1 million QA pairs.

Findings

01

Outperforms existing models on long multimodal document understanding by 8.6% F1.

02

Uses a sparse sampler to select relevant text and images, enhancing efficiency.

03

Constructed PaperPDF dataset with extensive QA pairs for training and evaluation.

Abstract

Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. However, existing methods typically focus on either plain text or a limited number of document images, struggling to handle long PDF documents with interleaved text and images, especially for academic papers. In this paper, we introduce PDF-WuKong, a multimodal large language model (MLLM) that is designed to enhance multimodal question-answering (QA) for long PDF documents. PDF-WuKong incorporates a sparse sampler that operates on both text and image representations, significantly improving the efficiency and capability of the MLLM. The sparse sampler selects the paragraphs or diagrams most pertinent to user queries. To effectively train and evaluate our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

yh0075/PaperPDF
dataset· 260 dl
260 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.