LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi, Yang, Tong Sun

TL;DR
LLaVAR enhances visual instruction tuning by incorporating text-rich images and GPT-4 generated data, significantly improving text-based visual question answering and reasoning capabilities in multimodal models.
Contribution
This work introduces a new dataset and training pipeline for text-rich images, leading to a multimodal model with superior understanding of textual details within images.
Findings
Up to 20% accuracy improvement on text-based VQA datasets
Achieved 91.42% accuracy on ScienceQA
Demonstrated improved reasoning and writing skills in multimodal interactions
Abstract
Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Absolute Position Encodings · Adam · Byte Pair Encoding · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer · Residual Connection
