ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich   Document Understanding

Qiming Peng; Yinxu Pan; Wenjin Wang; Bin Luo; Zhenyu Zhang; Zhengjie; Huang; Teng Hu; Weichong Yin; Yongfeng Chen; Yin Zhang; Shikun Feng; Yu Sun,; Hao Tian; Hua Wu; Haifeng Wang

arXiv:2210.06155·cs.CL·October 17, 2022·6 cites

ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding

Qiming Peng, Yinxu Pan, Wenjin Wang, Bin Luo, Zhenyu Zhang, Zhengjie, Huang, Teng Hu, Weichong Yin, Yongfeng Chen, Yin Zhang, Shikun Feng, Yu Sun,, Hao Tian, Hua Wu, Haifeng Wang

PDF

Open Access 2 Repos 1 Models

TL;DR

ERNIE-Layout introduces a layout-aware pre-training approach that effectively integrates text, layout, and image features, significantly improving performance on various visually-rich document understanding tasks.

Contribution

It presents a novel pre-training framework with layout knowledge enhancement, including reading order prediction and spatial-aware attention, for better document representations.

Findings

01

Achieves state-of-the-art results on key information extraction.

02

Sets new benchmarks in document image classification.

03

Improves document question answering performance.

Abstract

Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
Norm/ERNIE-Layout-Pytorch
model· 876 dl· ♡ 16
876 dl♡ 16

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Text and Document Classification Technologies · Multimodal Machine Learning Applications