DocLayLLM: An Efficient Multi-modal Extension of Large Language Models   for Text-rich Document Understanding

Wenhui Liao; Jiapeng Wang; Hongliang Li; Chengyu Wang; Jun Huang,; Lianwen Jin

arXiv:2408.15045·cs.CV·March 20, 2025

DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding

Wenhui Liao, Jiapeng Wang, Hongliang Li, Chengyu Wang, Jun Huang,, Lianwen Jin

PDF

Open Access 1 Repo

TL;DR

DocLayLLM is an efficient multi-modal extension of large language models designed for text-rich document understanding, integrating visual and positional information to improve performance with lightweight training.

Contribution

It introduces a novel lightweight multi-modal extension of LLMs with chain-of-thought techniques for improved document understanding.

Findings

01

Outperforms existing OCR-dependent methods.

02

Achieves high performance with lightweight training.

03

Effectively integrates visual and positional tokens.

Abstract

Text-rich document understanding (TDU) requires comprehensive analysis of documents containing substantial textual content and complex layouts. While Multimodal Large Language Models (MLLMs) have achieved fast progress in this domain, existing approaches either demand significant computational resources or struggle with effective multi-modal integration. In this paper, we introduce DocLayLLM, an efficient multi-modal extension of LLMs specifically designed for TDU. By lightly integrating visual patch tokens and 2D positional tokens into LLMs' input and encoding the document content using the LLMs themselves, we fully take advantage of the document comprehension capability of LLMs and enhance their perception of OCR information. We have also deeply considered the role of chain-of-thought (CoT) and innovatively proposed the techniques of CoT Pre-training and CoT Annealing. Our DocLayLLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

whlscut/DocLayLLM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Handwritten Text Recognition Techniques