mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document   Understanding

Anwen Hu; Haiyang Xu; Jiabo Ye; Ming Yan; Liang Zhang; Bo Zhang; Chen; Li; Ji Zhang; Qin Jin; Fei Huang; Jingren Zhou

arXiv:2403.12895·cs.CV·March 20, 2024·1 cites

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen, Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou

PDF

Open Access 1 Repo

TL;DR

mPLUG-DocOwl 1.5 introduces a unified structure learning approach for multimodal large language models to enhance understanding of text-rich images like documents, tables, and charts, achieving state-of-the-art results.

Contribution

The paper proposes a novel structure-aware parsing and multi-grained text localization framework, along with a vision-to-text module and new datasets, to significantly improve multimodal document understanding.

Findings

01

Achieves state-of-the-art on 10 benchmarks

02

Improves performance of 7B LLM by over 10 points in 5 benchmarks

03

Introduces new datasets DocStruct4M and DocReason25K

Abstract

Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

x-plug/mplug-docowl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Topic Modeling

MethodsSparse Evolutionary Training