Multimodal Pre-training Based on Graph Attention Network for Document Understanding
Zhenrong Zhang, Jiefeng Ma, Jun Du, Licheng Wang, Jianshu Zhang

TL;DR
GraphDoc is a multimodal graph attention-based model that pre-trains on text, layout, and image data to improve document understanding across diverse formats and layouts.
Contribution
It introduces a novel graph attention mechanism incorporating multimodal features for pre-training on unlabeled documents, enhancing document understanding performance.
Findings
Achieves state-of-the-art results on public datasets.
Effectively models contextual relationships in documents.
Utilizes only 320k unlabeled documents for pre-training.
Abstract
Document intelligence as a relatively new research topic supports many business applications. Its main task is to automatically read, understand, and analyze documents. However, due to the diversity of formats (invoices, reports, forms, etc.) and layouts in documents, it is difficult to make machines understand documents. In this paper, we present the GraphDoc, a multimodal graph attention-based model for various document understanding tasks. GraphDoc is pre-trained in a multimodal framework by utilizing text, layout, and image information simultaneously. In a document, a text block relies heavily on its surrounding contexts, accordingly we inject the graph structure into the attention mechanism to form a graph attention layer so that each input node can only attend to its neighborhoods. The input nodes of each graph attention layer are composed of textual, visual, and positional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Handwritten Text Recognition Techniques · Text and Document Classification Technologies
