ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization

Ahmad Mohammadshirazi; Pinaki Prasad Guha Neogi; Dheeraj Kulshrestha; Rajiv Ramnath

arXiv:2511.18192·cs.CV·December 1, 2025

ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization

Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Dheeraj Kulshrestha, Rajiv Ramnath

PDF

Open Access

TL;DR

The paper introduces ARIAL, a modular framework that combines LLM planning with specialized tools to improve both accuracy and spatial localization in Document VQA, enhancing interpretability and trustworthiness.

Contribution

ARIAL is a novel agentic framework that orchestrates multiple tools for precise answer localization and interpretability in Document VQA tasks.

Findings

01

Achieves state-of-the-art results on four benchmarks.

02

Improves both textual accuracy and spatial grounding.

03

Provides transparent reasoning traces for auditability.

Abstract

Document Visual Question Answering (VQA) requires models to not only extract accurate textual answers but also precisely localize them within document images, a capability critical for interpretability in high-stakes applications. However, existing systems achieve strong textual accuracy while producing unreliable spatial grounding, or sacrifice performance for interpretability. We present ARIAL (Agentic Reasoning for Interpretable Answer Localization), a modular framework that orchestrates specialized tools through an LLM-based planning agent to achieve both precise answer extraction and reliable spatial grounding. ARIAL decomposes Document VQA into structured subtasks: OCR-based text extraction with TrOCR, retrieval-augmented context selection using semantic search, answer generation via a fine-tuned Gemma 3-27B model, and explicit bounding-box localization through text-to-region…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)