DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning
Joonmyung Choi, Sanghyeok Lee, Jongha Kim, Sehyung Kim, Dohwan Ko, Jihyung Kil, Hyunwoo J. Kim

TL;DR
DocPrune is a training-free token pruning method that enhances the efficiency of document question answering models by removing irrelevant tokens, leading to significant speedups and improved accuracy.
Contribution
It introduces a novel, training-free, progressive token pruning framework tailored for long-document understanding that leverages document structure for efficiency.
Findings
Increases throughput by over 3x in encoder and decoder.
Boosts F1 score by +1.0 without additional training.
Effectively removes background and irrelevant tokens in documents.
Abstract
Recent advances in vision-language models have demonstrated remarkable performance across diverse multi-modal tasks, including document question answering that leverages structured visual cues from text, tables, and figures. However, unlike natural images, document images contain large backgrounds and only sparse supporting evidence, leading to the inefficient consumption of substantial computational resources, especially for long documents. We observe that existing token-reduction methods for natural images and videos fall short in utilizing the structural sparsity unique to documents. To address this, we propose DocPrune, a training-free and progressive document token pruning framework designed for efficient long-document understanding. The proposed method preserves only the essential tokens for the task while removing unnecessary ones, such as background or question-irrelevant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
