ROAP: A Reading-Order and Attention-Prior Pipeline for Optimizing Layout Transformers in Key Information Extraction
Tingwei Xie, Jinxin He, Yonghong Song

TL;DR
ROAP is a novel pipeline that enhances layout transformers for document understanding by explicitly modeling reading order and reducing visual noise, leading to improved performance on key information extraction tasks.
Contribution
This paper introduces ROAP, a lightweight, architecture-agnostic pipeline that optimizes attention in Layout Transformers without modifying pre-trained models, by modeling reading order and suppressing visual noise.
Findings
ROAP improves performance of LayoutLMv3 and GeoLayoutLM on FUNSD and CORD datasets.
Explicit reading order modeling enhances document understanding accuracy.
Suppressing visual noise refines textual interactions in multimodal transformers.
Abstract
The efficacy of Multimodal Transformers in visually-rich document understanding (VrDU) is critically constrained by two inherent limitations: the lack of explicit modeling for logical reading order and the interference of visual tokens that dilutes attention on textual semantics. To address these challenges, this paper presents ROAP, a lightweight and architecture-agnostic pipeline designed to optimize attention distributions in Layout Transformers without altering their pre-trained backbones. The proposed pipeline first employs an Adaptive-XY-Gap (AXG-Tree) to robustly extract hierarchical reading sequences from complex layouts. These sequences are then integrated into the attention mechanism via a Reading-Order-Aware Relative Position Bias (RO-RPB). Furthermore, a Textual-Token Sub-block Attention Prior (TT-Prior) is introduced to adaptively suppress visual noise and enhance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Topic Modeling
