DOGR: Towards Versatile Visual Document Grounding and Referring

Yinan Zhou; Yuxin Chen; Haokun Lin; Yichen Wu; Shuyu Yang; Zhongang Qi; Chen Ma; Li Zhu; and Ying Shan

arXiv:2411.17125·cs.CV·August 7, 2025

DOGR: Towards Versatile Visual Document Grounding and Referring

Yinan Zhou, Yuxin Chen, Haokun Lin, Yichen Wu, Shuyu Yang, Zhongang Qi, Chen Ma, Li Zhu, and Ying Shan

PDF

Open Access

TL;DR

This paper introduces DOGR, a new dataset and benchmark for fine-grained visual document grounding and referring, leveraging a data engine to enhance multimodal language models' capabilities in detailed document understanding.

Contribution

It presents the DOGR-Engine for generating high-quality data, constructs the DOGR-Bench for comprehensive evaluation, and develops the DOGR baseline model to improve document grounding and referring tasks.

Findings

01

DOGR-Engine effectively generates detailed parsing and instruction data.

02

DOGR-Bench covers seven diverse grounding and referring tasks.

03

The DOGR model outperforms existing methods in fine-grained document understanding.

Abstract

With recent advances in Multimodal Large Language Models (MLLMs), grounding and referring capabilities have gained increasing attention for achieving detailed understanding and flexible user interaction. However, these capabilities still remain underdeveloped in visual document understanding due to the scarcity of fine-grained datasets and comprehensive benchmarks. To fill this gap, we propose the DOcument Grounding and Referring data engine (DOGR-Engine), which generates two types of high-quality fine-grained document data: (1) multi-granular parsing data to improve text localization and recognition, and (2) instruction-tuning data to activate MLLMs' grounding and referring capabilities in dialogue and reasoning. Using the DOGR-Engine, we construct DOGR-Bench, a benchmark covering seven grounding and referring tasks across three document types (chart, poster, and PDF document),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Semantic Web and Ontologies · Natural Language Processing Techniques