DRISHTIKON: Visual Grounding at Multiple Granularities in Documents
Badri Vishal Kasuba, Parag Chaudhuri, Ganesh Ramakrishnan

TL;DR
DRISHTIKON introduces a multi-granular visual grounding framework for complex, multilingual documents, improving interpretability and accuracy in document understanding and VQA tasks through innovative region matching and a new benchmark.
Contribution
The paper presents a novel multi-granular visual grounding approach combining OCR, language models, and region matching, along with a new benchmark dataset for evaluation.
Findings
Achieves state-of-the-art grounding accuracy.
Line-level granularity offers optimal balance of precision and recall.
Multi-block and multi-line reasoning improve performance.
Abstract
Visual grounding in text-rich document images is a critical yet underexplored challenge for Document Intelligence and Visual Question Answering (VQA) systems. We present DRISHTIKON, a multi-granular and multi-block visual grounding framework designed to enhance interpretability and trust in VQA for complex, multilingual documents. Our approach integrates multilingual OCR, large language models, and a novel region matching algorithm to localize answer spans at the block, line, word, and point levels. We introduce the Multi-Granular Visual Grounding (MGVG) benchmark, a curated test set of diverse circular notifications from various sectors, each manually annotated with fine-grained, human-verified labels across multiple granularities. Extensive experiments show that our method achieves state-of-the-art grounding accuracy, with line-level granularity providing the best balance between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Natural Language Processing Techniques · Multimodal Machine Learning Applications
