LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei

TL;DR
LayoutLMv3 introduces a unified pre-training approach for Document AI that combines text and image masking with cross-modal alignment, achieving state-of-the-art results across diverse document understanding tasks.
Contribution
It proposes a simple, unified architecture and training objectives for multimodal pre-training, enabling effective learning for both text-centric and image-centric Document AI tasks.
Findings
Achieves state-of-the-art performance in form and receipt understanding.
Excels in document visual question answering and layout analysis.
Demonstrates effectiveness across diverse Document AI applications.
Abstract
Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose \textbf{LayoutLMv3} to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗microsoft/layoutlmv3-basemodel· 677k dl· ♡ 478677k dl♡ 478
- 🤗microsoft/layoutlmv3-largemodel· 68k dl· ♡ 12668k dl♡ 126
- 🤗HYPJUDY/layoutlmv3-base-finetuned-funsdmodel· 258 dl· ♡ 5258 dl♡ 5
- 🤗HYPJUDY/layoutlmv3-large-finetuned-funsdmodel· 39 dl· ♡ 539 dl♡ 5
- 🤗HYPJUDY/layoutlmv3-base-finetuned-publaynetmodel· 129 dl· ♡ 45129 dl♡ 45
- 🤗microsoft/layoutlmv3-base-chinesemodel· 3.2k dl· ♡ 813.2k dl♡ 81
- 🤗jinhybr/OCR-LayoutLMv3model· 36 dl36 dl
- 🤗seckmaster/microsoft-layoutlmv3-largemodel
- 🤗mkdigitalgmbh/runpo-LayoutLM3-Invoice-Receiptmodel· 3 dl3 dl
- 🤗karanjaWakaba/layoutlmv3-basemodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Topic Modeling
