LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image   Understanding

Yanzhe Zhang; Ruiyi Zhang; Jiuxiang Gu; Yufan Zhou; Nedim Lipka; Diyi; Yang; Tong Sun

arXiv:2306.17107·cs.CV·February 6, 2024·27 cites

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi, Yang, Tong Sun

PDF

Open Access 2 Repos 1 Models 1 Datasets

TL;DR

LLaVAR enhances visual instruction tuning by incorporating text-rich images and GPT-4 generated data, significantly improving text-based visual question answering and reasoning capabilities in multimodal models.

Contribution

This work introduces a new dataset and training pipeline for text-rich images, leading to a multimodal model with superior understanding of textual details within images.

Findings

01

Up to 20% accuracy improvement on text-based VQA datasets

02

Achieved 91.42% accuracy on ScienceQA

03

Demonstrated improved reasoning and writing skills in multimodal interactions

Abstract

Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
SALT-NLP/LLaVAR_delta
model· 11 dl· ♡ 16
11 dl♡ 16

Datasets

SALT-NLP/LLaVAR
dataset· 85 dl
85 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Absolute Position Encodings · Adam · Byte Pair Encoding · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer · Residual Connection