XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding
Zhangxuan Gu, Changhua Meng, Ke Wang, Jun Lan, Weiqiang Wang, Ming Gu,, Liqing Zhang

TL;DR
XYLayoutLM is a novel multimodal network for visually-rich document understanding that effectively captures layout information from OCR-processed reading orders and utilizes a new position encoding to improve performance.
Contribution
The paper introduces XYLayoutLM, which leverages proper reading orders and a Dilated Conditional Position Encoding to enhance layout-aware multimodal document understanding.
Findings
Achieves competitive results on document understanding tasks.
Effectively captures layout information from OCR reading orders.
Introduces a Dilated Conditional Position Encoding module.
Abstract
Recently, various multimodal networks for Visually-Rich Document Understanding(VRDU) have been proposed, showing the promotion of transformers by integrating visual and layout information with the text embeddings. However, most existing approaches utilize the position embeddings to incorporate the sequence information, neglecting the noisy improper reading order obtained by OCR tools. In this paper, we propose a robust layout-aware multimodal network named XYLayoutLM to capture and leverage rich layout information from proper reading orders produced by our Augmented XY Cut. Moreover, a Dilated Conditional Position Encoding module is proposed to deal with the input sequence of variable lengths, and it additionally extracts local layout information from both textual and visual modalities while generating position embeddings. Experiment results show that our XYLayoutLM achieves competitive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Video Analysis and Summarization
