Adaptive Markup Language Generation for Contextually-Grounded Visual   Document Understanding

Han Xiao; Yina Xie; Guanxin Tan; Yinghao Chen; Rui Hu; Ke Wang; Aojun; Zhou; Hao Li; Hao Shao; Xudong Lu; Peng Gao; Yafei Wen; Xiaoxin Chen; Shuai; Ren; Hongsheng Li

arXiv:2505.05446·cs.CV·May 9, 2025

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Han Xiao, Yina Xie, Guanxin Tan, Yinghao Chen, Rui Hu, Ke Wang, Aojun, Zhou, Hao Li, Hao Shao, Xudong Lu, Peng Gao, Yafei Wen, Xiaoxin Chen, Shuai, Ren, Hongsheng Li

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces an adaptive markup language generation approach for visual document understanding, leveraging structured representations to improve comprehension and reasoning in complex visual scenarios.

Contribution

It presents a novel pipeline using adaptive markup generation and introduces two large datasets for training and fine-tuning models in this domain.

Findings

01

Model outperforms state-of-the-art MLLMs on visual document benchmarks.

02

Structured datasets enable better spatial and contextual understanding.

03

Approach reduces hallucinations and improves reasoning in visual documents.

Abstract

Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts. Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextual information for robust understanding, leading to hallucinations and limited comprehension of spatial relationships among visual elements. To address these challenges, we propose an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. We introduce two fine-grained structured datasets: DocMark-Pile, comprising approximately 3.8M…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Euphoria16/DocMark
pytorchOfficial

Models

🤗
HanXiao1999/DocMark-Pretrain-2B
model· 3 dl
3 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Data Visualization and Analytics