DRISHTIKON: Visual Grounding at Multiple Granularities in Documents

Badri Vishal Kasuba; Parag Chaudhuri; Ganesh Ramakrishnan

arXiv:2506.21316·cs.CV·July 17, 2025

DRISHTIKON: Visual Grounding at Multiple Granularities in Documents

Badri Vishal Kasuba, Parag Chaudhuri, Ganesh Ramakrishnan

PDF

Open Access 1 Repo

TL;DR

DRISHTIKON introduces a multi-granular visual grounding framework for complex, multilingual documents, improving interpretability and accuracy in document understanding and VQA tasks through innovative region matching and a new benchmark.

Contribution

The paper presents a novel multi-granular visual grounding approach combining OCR, language models, and region matching, along with a new benchmark dataset for evaluation.

Findings

01

Achieves state-of-the-art grounding accuracy.

02

Line-level granularity offers optimal balance of precision and recall.

03

Multi-block and multi-line reasoning improve performance.

Abstract

Visual grounding in text-rich document images is a critical yet underexplored challenge for Document Intelligence and Visual Question Answering (VQA) systems. We present DRISHTIKON, a multi-granular and multi-block visual grounding framework designed to enhance interpretability and trust in VQA for complex, multilingual documents. Our approach integrates multilingual OCR, large language models, and a novel region matching algorithm to localize answer spans at the block, line, word, and point levels. We introduce the Multi-Granular Visual Grounding (MGVG) benchmark, a curated test set of diverse circular notifications from various sectors, each manually annotated with fine-grained, human-verified labels across multiple granularities. Extensive experiments show that our method achieves state-of-the-art grounding accuracy, with line-level granularity providing the best balance between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kasuba-badri-vishal/dhrishtikon
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Natural Language Processing Techniques · Multimodal Machine Learning Applications