Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding
Han Xiao, Yina Xie, Guanxin Tan, Yinghao Chen, Rui Hu, Ke Wang, Aojun, Zhou, Hao Li, Hao Shao, Xudong Lu, Peng Gao, Yafei Wen, Xiaoxin Chen, Shuai, Ren, Hongsheng Li

TL;DR
This paper introduces an adaptive markup language generation approach for visual document understanding, leveraging structured representations to improve comprehension and reasoning in complex visual scenarios.
Contribution
It presents a novel pipeline using adaptive markup generation and introduces two large datasets for training and fine-tuning models in this domain.
Findings
Model outperforms state-of-the-art MLLMs on visual document benchmarks.
Structured datasets enable better spatial and contextual understanding.
Approach reduces hallucinations and improves reasoning in visual documents.
Abstract
Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts. Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextual information for robust understanding, leading to hallucinations and limited comprehension of spatial relationships among visual elements. To address these challenges, we propose an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. We introduce two fine-grained structured datasets: DocMark-Pile, comprising approximately 3.8M…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Data Visualization and Analytics
