XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich   Document Understanding

Zhangxuan Gu; Changhua Meng; Ke Wang; Jun Lan; Weiqiang Wang; Ming Gu,; Liqing Zhang

arXiv:2203.06947·cs.CV·March 16, 2022·1 cites

XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding

Zhangxuan Gu, Changhua Meng, Ke Wang, Jun Lan, Weiqiang Wang, Ming Gu,, Liqing Zhang

PDF

Open Access 1 Repo

TL;DR

XYLayoutLM is a novel multimodal network for visually-rich document understanding that effectively captures layout information from OCR-processed reading orders and utilizes a new position encoding to improve performance.

Contribution

The paper introduces XYLayoutLM, which leverages proper reading orders and a Dilated Conditional Position Encoding to enhance layout-aware multimodal document understanding.

Findings

01

Achieves competitive results on document understanding tasks.

02

Effectively captures layout information from OCR reading orders.

03

Introduces a Dilated Conditional Position Encoding module.

Abstract

Recently, various multimodal networks for Visually-Rich Document Understanding(VRDU) have been proposed, showing the promotion of transformers by integrating visual and layout information with the text embeddings. However, most existing approaches utilize the position embeddings to incorporate the sequence information, neglecting the noisy improper reading order obtained by OCR tools. In this paper, we propose a robust layout-aware multimodal network named XYLayoutLM to capture and leverage rich layout information from proper reading orders produced by our Augmented XY Cut. Moreover, a Dilated Conditional Position Encoding module is proposed to deal with the input sequence of variable lengths, and it additionally extracts local layout information from both textual and visual modalities while generating position embeddings. Experiment results show that our XYLayoutLM achieves competitive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

littletomatodonkey/Augment-XY-CUT
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Video Analysis and Summarization