ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization
Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Dheeraj Kulshrestha, Rajiv Ramnath

TL;DR
The paper introduces ARIAL, a modular framework that combines LLM planning with specialized tools to improve both accuracy and spatial localization in Document VQA, enhancing interpretability and trustworthiness.
Contribution
ARIAL is a novel agentic framework that orchestrates multiple tools for precise answer localization and interpretability in Document VQA tasks.
Findings
Achieves state-of-the-art results on four benchmarks.
Improves both textual accuracy and spatial grounding.
Provides transparent reasoning traces for auditability.
Abstract
Document Visual Question Answering (VQA) requires models to not only extract accurate textual answers but also precisely localize them within document images, a capability critical for interpretability in high-stakes applications. However, existing systems achieve strong textual accuracy while producing unreliable spatial grounding, or sacrifice performance for interpretability. We present ARIAL (Agentic Reasoning for Interpretable Answer Localization), a modular framework that orchestrates specialized tools through an LLM-based planning agent to achieve both precise answer extraction and reliable spatial grounding. ARIAL decomposes Document VQA into structured subtasks: OCR-based text extraction with TrOCR, retrieval-augmented context selection using semantic search, answer generation via a fine-tuned Gemma 3-27B model, and explicit bounding-box localization through text-to-region…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
