LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training   for Document Understanding

Yi Tu; Ya Guo; Huan Chen; Jinyang Tang

arXiv:2305.18721·cs.CV·June 12, 2023·2 cites

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

Yi Tu, Ya Guo, Huan Chen, Jinyang Tang

PDF

Open Access

TL;DR

LayoutMask is a novel multi-modal pre-training model that enhances text-layout interaction in visually-rich document understanding, achieving state-of-the-art results across various document analysis tasks.

Contribution

It introduces a new layout input method and two pre-training objectives to improve text-layout fusion in document understanding models.

Findings

01

Achieves state-of-the-art performance on form understanding.

02

Improves robustness in receipt understanding.

03

Enhances document image classification accuracy.

Abstract

Visually-rich Document Understanding (VrDU) has attracted much research attention over the past years. Pre-trained models on a large number of document images with transformer-based backbones have led to significant performance gains in this field. The major challenge is how to fusion the different modalities (text, layout, and image) of the documents in a unified model with different pre-training tasks. This paper focuses on improving text-layout interactions and proposes a novel multi-modal pre-training model, LayoutMask. LayoutMask uses local 1D position, instead of global 1D position, as layout input and has two pre-training objectives: (1) Masked Language Modeling: predicting masked tokens with two novel masking strategies; (2) Masked Position Modeling: predicting masked 2D positions to improve layout representation learning. LayoutMask can enhance the interactions between text and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Music and Audio Processing · Multimodal Machine Learning Applications