LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding
Yi Tu, Ya Guo, Huan Chen, Jinyang Tang

TL;DR
LayoutMask is a novel multi-modal pre-training model that enhances text-layout interaction in visually-rich document understanding, achieving state-of-the-art results across various document analysis tasks.
Contribution
It introduces a new layout input method and two pre-training objectives to improve text-layout fusion in document understanding models.
Findings
Achieves state-of-the-art performance on form understanding.
Improves robustness in receipt understanding.
Enhances document image classification accuracy.
Abstract
Visually-rich Document Understanding (VrDU) has attracted much research attention over the past years. Pre-trained models on a large number of document images with transformer-based backbones have led to significant performance gains in this field. The major challenge is how to fusion the different modalities (text, layout, and image) of the documents in a unified model with different pre-training tasks. This paper focuses on improving text-layout interactions and proposes a novel multi-modal pre-training model, LayoutMask. LayoutMask uses local 1D position, instead of global 1D position, as layout input and has two pre-training objectives: (1) Masked Language Modeling: predicting masked tokens with two novel masking strategies; (2) Masked Position Modeling: predicting masked 2D positions to improve layout representation learning. LayoutMask can enhance the interactions between text and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Music and Audio Processing · Multimodal Machine Learning Applications
