LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models
Ruiyi Zhang, Yufan Zhou, Jian Chen, Jiuxiang Gu, Changyou Chen, Tong, Sun

TL;DR
LLaVA-Read improves multimodal language models' ability to understand complex textual content within images by integrating specialized visual text encoders, surpassing previous models in text-rich image comprehension tasks.
Contribution
The paper introduces LLaVA-Read, a multimodal model with dual visual encoders and a visual text encoder, addressing limitations of classical visual encoders in textual content understanding.
Findings
Outperforms existing models in text-rich image understanding tasks
Classical visual encoders have limitations in visual text comprehension
Visual text understanding remains a key challenge for multimodal models
Abstract
Large multimodal language models have demonstrated impressive capabilities in understanding and manipulating images. However, many of these models struggle with comprehending intensive textual contents embedded within the images, primarily due to the limited text recognition and layout understanding ability. To understand the sources of these limitations, we perform an exploratory analysis showing the drawbacks of classical visual encoders on visual text understanding. Hence, we present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder. Our model surpasses existing state-of-the-art models in various text-rich image understanding tasks, showcasing enhanced comprehension of textual content within images. Together, our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
