Multimodal Pre-training Based on Graph Attention Network for Document   Understanding

Zhenrong Zhang; Jiefeng Ma; Jun Du; Licheng Wang; Jianshu Zhang

arXiv:2203.13530·cs.CV·October 25, 2022·1 cites

Multimodal Pre-training Based on Graph Attention Network for Document Understanding

Zhenrong Zhang, Jiefeng Ma, Jun Du, Licheng Wang, Jianshu Zhang

PDF

Open Access 1 Repo

TL;DR

GraphDoc is a multimodal graph attention-based model that pre-trains on text, layout, and image data to improve document understanding across diverse formats and layouts.

Contribution

It introduces a novel graph attention mechanism incorporating multimodal features for pre-training on unlabeled documents, enhancing document understanding performance.

Findings

01

Achieves state-of-the-art results on public datasets.

02

Effectively models contextual relationships in documents.

03

Utilizes only 320k unlabeled documents for pre-training.

Abstract

Document intelligence as a relatively new research topic supports many business applications. Its main task is to automatically read, understand, and analyze documents. However, due to the diversity of formats (invoices, reports, forms, etc.) and layouts in documents, it is difficult to make machines understand documents. In this paper, we present the GraphDoc, a multimodal graph attention-based model for various document understanding tasks. GraphDoc is pre-trained in a multimodal framework by utilizing text, layout, and image information simultaneously. In a document, a text block relies heavily on its surrounding contexts, accordingly we inject the graph structure into the attention mechanism to form a graph attention layer so that each input node can only attend to its neighborhoods. The input nodes of each graph attention layer are composed of textual, visual, and positional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zzr8066/graphdoc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Handwritten Text Recognition Techniques · Text and Document Classification Technologies