DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness

Ahmad Mohammadshirazi; Pinaki Prasad Guha Neogi; Ser-Nam Lim; and Rajiv Ramnath

arXiv:2412.00151·cs.CV·July 11, 2025

DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness

Ahmad Mohammadshirazi, Pinaki Prasad Guha Neogi, Ser-Nam Lim, and Rajiv Ramnath

PDF

Open Access 1 Repo

TL;DR

DLaVA introduces a zero-shot, OCR-free document VQA pipeline leveraging multimodal large language models, enhancing interpretability, trustworthiness, and efficiency in answer localization within complex document layouts.

Contribution

It presents a novel OCR-free, training-free approach using MLLMs for zero-shot answer localization, reducing complexity and improving interpretability in document VQA.

Findings

01

Competitive performance on benchmark datasets

02

Lower computational complexity compared to state-of-the-art

03

Enhanced trustworthiness and interpretability

Abstract

Document Visual Question Answering (VQA) demands robust integration of text detection, recognition, and spatial reasoning to interpret complex document layouts. In this work, we introduce DLaVA, a novel, training-free pipeline that leverages Multimodal Large Language Models (MLLMs) for zero-shot answer localization in order to improve trustworthiness, interpretability, and explainability. By leveraging an innovative OCR-free approach that organizes text regions with unique bounding box IDs, the proposed method preserves spatial contexts without relying on iterative OCR or chain-of-thought reasoning, thus substantially reducing the computational complexity. We further enhance the evaluation protocol by integrating Intersection over Union (IoU) metrics alongside Average Normalized Levenshtein Similarity (ANLS), thereby ensuring that not only textual accuracy is considered, but spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ahmad-shirazi/AnnotMLLM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsFocus