LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document   Understanding

Yang Xu; Yiheng Xu; Tengchao Lv; Lei Cui; Furu Wei; Guoxin Wang,; Yijuan Lu; Dinei Florencio; Cha Zhang; Wanxiang Che; Min Zhang; Lidong Zhou

arXiv:2012.14740·cs.CL·January 11, 2022·59 cites

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang,, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou

PDF

Open Access 5 Repos 3 Models

TL;DR

LayoutLMv2 introduces a multi-modal pre-training framework that effectively models text, layout, and image interactions for visually-rich document understanding, achieving state-of-the-art results across multiple datasets.

Contribution

It proposes a novel multi-modal Transformer architecture with new pre-training tasks and spatial-aware self-attention, enhancing cross-modality understanding in document analysis.

Findings

01

Outperforms LayoutLM by a large margin.

02

Achieves new state-of-the-art on multiple datasets.

03

Model and code are publicly available.

Abstract

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Feature Pyramid Network · Average Pooling · Global Average Pooling · Kaiming Initialization · Convolution · Grouped Convolution · Batch Normalization · ResNeXt Block