Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents
Davide Napolitano, Luca Cagliero, Fabrizio Battiloro

TL;DR
This paper introduces VRD-UQA, a benchmark for assessing the robustness of Visual Large Language Models in identifying unanswerable questions in visually rich documents, highlighting current limitations and guiding future improvements.
Contribution
The paper presents VRD-UQA, a novel benchmark for evaluating VLLMs' resilience to unanswerable questions in VRDs, including a systematic analysis of model performance and corruption effects.
Findings
VLLMs show limited accuracy in detecting unanswerable questions.
Different corruption types impact model performance variably.
Knowledge injection strategies influence unanswerability detection effectiveness.
Abstract
The evolution of Visual Large Language Models (VLLMs) has revolutionized the automatic understanding of Visually Rich Documents (VRDs), which contain both textual and visual elements. Although VLLMs excel in Visual Question Answering (VQA) on multi-page VRDs, their ability to detect unanswerable questions is still an open research question. Our research delves into the robustness of the VLLMs to plausible yet unanswerable questions, i.e., questions that appear valid but cannot be answered due to subtle corruptions caused by swaps between related concepts or plausible question formulations. Corruptions are generated by replacing the original natural language entities with other ones of the same type, belonging to different document elements, and in different layout positions or pages of the related document. To this end, we present VRD-UQA (VISUALLY RICH DOCUMENT UNANSWERABLE QUESTION…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Data Visualization and Analytics
