LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Ruiyi Zhang; Yufan Zhou; Jian Chen; Jiuxiang Gu; Changyou Chen; Tong; Sun

arXiv:2407.19185·cs.CV·July 30, 2024

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

Ruiyi Zhang, Yufan Zhou, Jian Chen, Jiuxiang Gu, Changyou Chen, Tong, Sun

PDF

Open Access

TL;DR

LLaVA-Read improves multimodal language models' ability to understand complex textual content within images by integrating specialized visual text encoders, surpassing previous models in text-rich image comprehension tasks.

Contribution

The paper introduces LLaVA-Read, a multimodal model with dual visual encoders and a visual text encoder, addressing limitations of classical visual encoders in textual content understanding.

Findings

01

Outperforms existing models in text-rich image understanding tasks

02

Classical visual encoders have limitations in visual text comprehension

03

Visual text understanding remains a key challenge for multimodal models

Abstract

Large multimodal language models have demonstrated impressive capabilities in understanding and manipulating images. However, many of these models struggle with comprehending intensive textual contents embedded within the images, primarily due to the limited text recognition and layout understanding ability. To understand the sources of these limitations, we perform an exploratory analysis showing the drawbacks of classical visual encoders on visual text understanding. Hence, we present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder. Our model surpasses existing state-of-the-art models in various text-rich image understanding tasks, showcasing enhanced comprehension of textual content within images. Together, our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification