TRINS: Towards Multimodal Language Models that Can Read

Ruiyi Zhang; Yanzhe Zhang; Jian Chen; Yufan Zhou; Jiuxiang Gu,; Changyou Chen; Tong Sun

arXiv:2406.06730·cs.CV·June 12, 2024·1 cites

TRINS: Towards Multimodal Language Models that Can Read

Ruiyi Zhang, Yanzhe Zhang, Jian Chen, Yufan Zhou, Jiuxiang Gu,, Changyou Chen, Tong Sun

PDF

Open Access 1 Repo

TL;DR

TRINS is a new dataset designed to improve multimodal language models' ability to read and understand text within images, addressing a key limitation of existing models by providing longer, more complex annotations.

Contribution

The paper introduces TRINS, a large text-rich image dataset, and a novel architecture LaRA that significantly enhances multimodal models' reading capabilities.

Findings

01

LaRA outperforms existing models on TRINS and classical benchmarks.

02

TRINS dataset contains over 39,000 text-rich images with long annotations.

03

Models trained on TRINS show improved understanding of textual content in images.

Abstract

Large multimodal language models have shown remarkable proficiency in understanding and editing images. However, a majority of these visually-tuned models struggle to comprehend the textual content embedded in images, primarily due to the limitation of training data. In this work, we introduce TRINS: a Text-Rich image INStruction dataset, with the objective of enhancing the reading ability of the multimodal large language model. TRINS is built upon LAION using hybrid data annotation strategies that include machine-assisted and human-assisted annotation processes. It contains 39,153 text-rich images, captions, and 102,437 questions. Specifically, we show that the number of words per annotation in TRINS is significantly longer than that of related datasets, providing new challenges. Furthermore, we introduce a simple and effective architecture, called a Language-vision Reading Assistant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llavar/llavar-2
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems