MGDoc: Pre-training with Multi-granular Hierarchy for Document Image   Understanding

Zilong Wang; Jiuxiang Gu; Chris Tensmeyer; Nikolaos Barmpalios; Ani; Nenkova; Tong Sun; Jingbo Shang; Vlad I. Morariu

arXiv:2211.14958·cs.CV·November 29, 2022

MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding

Zilong Wang, Jiuxiang Gu, Chris Tensmeyer, Nikolaos Barmpalios, Ani, Nenkova, Tong Sun, Jingbo Shang, Vlad I. Morariu

PDF

Open Access

TL;DR

MGDoc introduces a multi-granular pre-training framework that effectively encodes and models hierarchical relationships in document images across page, region, and word levels, improving understanding tasks.

Contribution

The paper presents MGDoc, a novel multi-modal, multi-granular pre-training approach with a unified encoder and cross-granular attention for hierarchical document image understanding.

Findings

01

Improved performance across multiple document understanding tasks.

02

Effective encoding of hierarchical relationships in document images.

03

Enhanced feature learning for multi-granular information.

Abstract

Document images are a ubiquitous source of data where the text is organized in a complex hierarchical structure ranging from fine granularity (e.g., words), medium granularity (e.g., regions such as paragraphs or figures), to coarse granularity (e.g., the whole page). The spatial hierarchical relationships between content at different levels of granularity are crucial for document image understanding tasks. Existing methods learn features from either word-level or region-level but fail to consider both simultaneously. Word-level models are restricted by the fact that they originate from pure-text language models, which only encode the word-level context. In contrast, region-level models attempt to encode regions corresponding to paragraphs or text blocks into a single embedding, but they perform worse with additional word-level features. To deal with these issues, we propose MGDoc, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Topic Modeling

Methodsfail