A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

Jinghui Lu; Haiyang Yu; Yanjie Wang; Yongjie Ye; Jingqun Tang; Ziwei Yang; Binghong Wu; Qi Liu; Hao Feng; Han Wang; Hao Liu; Can Huang

arXiv:2407.01976·cs.CL·May 20, 2025

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, Hao Liu, Can Huang

PDF

Open Access 1 Repo

TL;DR

LayTextLLM introduces an efficient method that interleaves layout and text embeddings in large language models, significantly improving document understanding tasks like KIE and VQA by avoiding sequence length issues and leveraging autoregressive capabilities.

Contribution

The paper proposes LayTextLLM, a novel approach that interleaves layout and text embeddings to enhance document understanding in LLMs, addressing sequence length and integration challenges.

Findings

01

15.2% improvement on KIE tasks

02

10.7% improvement on VQA tasks

03

Efficiently combines layout and text embeddings

Abstract

Recently, many studies have demonstrated that exclusively incorporating OCR-derived text and spatial layouts with large language models (LLMs) can be highly effective for document understanding tasks. However, existing methods that integrate spatial layouts with text have limitations, such as producing overly long text sequences or failing to fully leverage the autoregressive traits of LLMs. In this work, we introduce Interleaving Layout and Text in a Large Language Model (LayTextLLM)} for document understanding. LayTextLLM projects each bounding box to a single embedding and interleaves it with text, efficiently avoiding long sequence issues while leveraging autoregressive traits of LLMs. LayTextLLM not only streamlines the interaction of layout and textual data but also shows enhanced performance in KIE and VQA. Comprehensive benchmark evaluations reveal significant improvements of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

laytextllm/laytextllm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling