MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding
Zilong Wang, Jiuxiang Gu, Chris Tensmeyer, Nikolaos Barmpalios, Ani, Nenkova, Tong Sun, Jingbo Shang, Vlad I. Morariu

TL;DR
MGDoc introduces a multi-granular pre-training framework that effectively encodes and models hierarchical relationships in document images across page, region, and word levels, improving understanding tasks.
Contribution
The paper presents MGDoc, a novel multi-modal, multi-granular pre-training approach with a unified encoder and cross-granular attention for hierarchical document image understanding.
Findings
Improved performance across multiple document understanding tasks.
Effective encoding of hierarchical relationships in document images.
Enhanced feature learning for multi-granular information.
Abstract
Document images are a ubiquitous source of data where the text is organized in a complex hierarchical structure ranging from fine granularity (e.g., words), medium granularity (e.g., regions such as paragraphs or figures), to coarse granularity (e.g., the whole page). The spatial hierarchical relationships between content at different levels of granularity are crucial for document image understanding tasks. Existing methods learn features from either word-level or region-level but fail to consider both simultaneously. Word-level models are restricted by the fact that they originate from pure-text language models, which only encode the word-level context. In contrast, region-level models attempt to encode regions corresponding to paragraphs or text blocks into a single embedding, but they perform worse with additional word-level features. To deal with these issues, we propose MGDoc, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Topic Modeling
Methodsfail
