LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang,, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou

TL;DR
LayoutLMv2 introduces a multi-modal pre-training framework that effectively models text, layout, and image interactions for visually-rich document understanding, achieving state-of-the-art results across multiple datasets.
Contribution
It proposes a novel multi-modal Transformer architecture with new pre-training tasks and spatial-aware self-attention, enhancing cross-modality understanding in document analysis.
Findings
Outperforms LayoutLM by a large margin.
Achieves new state-of-the-art on multiple datasets.
Model and code are publicly available.
Abstract
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Feature Pyramid Network · Average Pooling · Global Average Pooling · Kaiming Initialization · Convolution · Grouped Convolution · Batch Normalization · ResNeXt Block
