AdaDocVQA: Adaptive Framework for Long Document Visual Question Answering in Low-Resource Settings

Haoxuan Li; Wei Song; Aofan Liu; Peiwu Qin

arXiv:2508.13606·cs.CL·August 20, 2025

AdaDocVQA: Adaptive Framework for Long Document Visual Question Answering in Low-Resource Settings

Haoxuan Li, Wei Song, Aofan Liu, Peiwu Qin

PDF

TL;DR

AdaDocVQA introduces an adaptive framework that enhances long document visual question answering in low-resource settings through hybrid retrieval, data augmentation, and dynamic inference, achieving state-of-the-art results in Japanese benchmarks.

Contribution

The paper presents a novel unified framework combining document segmentation, automated data augmentation, and adaptive inference for improved low-resource document VQA.

Findings

01

Achieved 83.04% accuracy on Yes/No questions in JDocQA.

02

Improved factual question accuracy to 52.66%.

03

Established new state-of-the-art results for Japanese document VQA.

Abstract

Document Visual Question Answering (Document VQA) faces significant challenges when processing long documents in low-resource environments due to context limitations and insufficient training data. This paper presents AdaDocVQA, a unified adaptive framework addressing these challenges through three core innovations: a hybrid text retrieval architecture for effective document segmentation, an intelligent data augmentation pipeline that automatically generates high-quality reasoning question-answer pairs with multi-level verification, and adaptive ensemble inference with dynamic configuration generation and early stopping mechanisms. Experiments on Japanese document VQA benchmarks demonstrate substantial improvements with 83.04\% accuracy on Yes/No questions, 52.66\% on factual questions, and 44.12\% on numerical questions in JDocQA, and 59\% accuracy on LAVA dataset. Ablation studies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.