UReader: Universal OCR-free Visually-situated Language Understanding   with Multimodal Large Language Model

Jiabo Ye; Anwen Hu; Haiyang Xu; Qinghao Ye; Ming Yan; Guohai Xu,; Chenliang Li; Junfeng Tian; Qi Qian; Ji Zhang; Qin Jin; Liang He; Xin Alex; Lin; Fei Huang

arXiv:2310.05126·cs.CV·October 10, 2023·5 cites

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu,, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Alex, Lin, Fei Huang

PDF

Open Access 1 Repo 1 Models 2 Datasets

TL;DR

UReader introduces a universal, OCR-free multimodal language model that efficiently understands visually-situated language tasks across diverse domains with minimal fine-tuning.

Contribution

It is the first to explore OCR-free universal language understanding with a multimodal large language model, finetuned on multiple tasks with low parameter updates and training cost.

Findings

01

Achieves state-of-the-art OCR-free performance on 8 out of 10 tasks

02

Requires only 1.2% of parameters to be finetuned

03

Operates effectively across 5 different visual domains

Abstract

Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and the training cost is much lower than previous work following domain-specific pretraining and finetuning paradigms. Concretely, UReader is jointly finetuned on a wide range of Visually-situated Language Understanding tasks via a unified instruction format. To enhance the visual text and semantic understanding, we further apply two auxiliary tasks with the same format, namely text reading and key points generation tasks. We design a shape-adaptive cropping module before the encoder-decoder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lukeforeveryoung/ureader
pytorchOfficial

Models

🤗
hanchaow/QTuneVL1_5-2B
model· 2 dl
2 dl

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning