TRINS: Towards Multimodal Language Models that Can Read
Ruiyi Zhang, Yanzhe Zhang, Jian Chen, Yufan Zhou, Jiuxiang Gu,, Changyou Chen, Tong Sun

TL;DR
TRINS is a new dataset designed to improve multimodal language models' ability to read and understand text within images, addressing a key limitation of existing models by providing longer, more complex annotations.
Contribution
The paper introduces TRINS, a large text-rich image dataset, and a novel architecture LaRA that significantly enhances multimodal models' reading capabilities.
Findings
LaRA outperforms existing models on TRINS and classical benchmarks.
TRINS dataset contains over 39,000 text-rich images with long annotations.
Models trained on TRINS show improved understanding of textual content in images.
Abstract
Large multimodal language models have shown remarkable proficiency in understanding and editing images. However, a majority of these visually-tuned models struggle to comprehend the textual content embedded in images, primarily due to the limitation of training data. In this work, we introduce TRINS: a Text-Rich image INStruction dataset, with the objective of enhancing the reading ability of the multimodal large language model. TRINS is built upon LAION using hybrid data annotation strategies that include machine-assisted and human-assisted annotation processes. It contains 39,153 text-rich images, captions, and 102,437 questions. Specifically, we show that the number of words per annotation in TRINS is significantly longer than that of related datasets, providing new challenges. Furthermore, we introduce a simple and effective architecture, called a Language-vision Reading Assistant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
