Knowing Where and What: Unified Word Block Pretraining for Document Understanding
Song Tao, Zijian Wang, Tiantian Fan, Canjie Luo, Can Huang

TL;DR
UTel is a unified pre-training model for document understanding that effectively combines text and layout information, outperforming previous methods without relying on image modalities.
Contribution
The paper introduces UTel, a novel language model with unified text and layout pre-training, including two new tasks and a relative position embedding, enabling better document representation.
Findings
UTel outperforms previous methods on downstream tasks.
It effectively models semantic and spatial features jointly.
UTel can process arbitrary-length sequences without positional embeddings.
Abstract
Due to the complex layouts of documents, it is challenging to extract information for documents. Most previous studies develop multimodal pre-trained models in a self-supervised way. In this paper, we focus on the embedding learning of word blocks containing text and layout information, and propose UTel, a language model with Unified TExt and Layout pre-training. Specifically, we propose two pre-training tasks: Surrounding Word Prediction (SWP) for the layout learning, and Contrastive learning of Word Embeddings (CWE) for identifying different word blocks. Moreover, we replace the commonly used 1D position embedding with a 1D clipped relative position embedding. In this way, the joint training of Masked Layout-Language Modeling (MLLM) and two newly proposed tasks enables the interaction between semantic and spatial features in a unified way. Additionally, the proposed UTel can process…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Music and Audio Processing · Natural Language Processing Techniques
MethodsContrastive Learning
