Enhancing Visually-Rich Document Understanding via Layout Structure Modeling
Qiwei Li, Zuchao Li, Xiantao Cai, Bo Du, Hai Zhao

TL;DR
This paper introduces GraphLayoutLM, a new model that incorporates layout structure graphs into document understanding, significantly improving performance by modeling spatial relationships between text elements.
Contribution
The paper presents GraphLayoutLM, a novel approach that integrates layout structure graphs and layout-aware self-attention to enhance visually-rich document understanding.
Findings
Achieves state-of-the-art results on FUNSD, XFUND, and CORD datasets.
Both graph reordering and layout-aware attention are crucial for optimal performance.
Significant improvement over existing models by incorporating layout information.
Abstract
In recent years, the use of multi-modal pre-trained Transformers has led to significant advancements in visually-rich document understanding. However, existing models have mainly focused on features such as text and vision while neglecting the importance of layout relationship between text nodes. In this paper, we propose GraphLayoutLM, a novel document understanding model that leverages the modeling of layout structure graph to inject document layout knowledge into the model. GraphLayoutLM utilizes a graph reordering algorithm to adjust the text sequence based on the graph structure. Additionally, our model uses a layout-aware multi-head self-attention layer to learn document layout knowledge. The proposed model enables the understanding of the spatial arrangement of text elements, improving document comprehension. We evaluate our model on various benchmarks, including FUNSD, XFUND and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Video Analysis and Summarization · Natural Language Processing Techniques
