A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid   Instruction Generation

Shijie Zhou; Ruiyi Zhang; Yufan Zhou; Changyou Chen

arXiv:2412.16364·cs.CV·December 24, 2024

A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation

Shijie Zhou, Ruiyi Zhang, Yufan Zhou, Changyou Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces LLaVAR-2, a high-quality dataset for text-rich image instruction tuning created through hybrid human and AI annotation, significantly improving multimodal model performance.

Contribution

It presents a novel hybrid instruction generation method combining human annotations and GPT-4o to produce a large, high-quality dataset for training multimodal models.

Findings

01

Models fine-tuned on LLaVAR-2 outperform those trained on self-instruct data.

02

The dataset contains 424,000 high-quality instruction-image pairs.

03

Enhanced multimodal alignment improves model understanding of text-rich images.

Abstract

Large multimodal models still struggle with text-rich images because of inadequate training data. Self-Instruct provides an annotation-free way for generating instruction data, but its quality is poor, as multimodal alignment remains a hurdle even for the largest models. In this work, we propose LLaVAR-2, to enhance multimodal alignment for text-rich images through hybrid instruction generation between human annotators and large language models. Specifically, it involves detailed image captions from human annotators, followed by the use of these annotations in tailored text prompts for GPT-4o to curate a dataset. It also implements several mechanisms to filter out low-quality data, and the resulting dataset comprises 424k high-quality pairs of instructions. Empirical results show that models fine-tuned on this dataset exhibit impressive enhancements over those trained with self-instruct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llavar/llavar-2
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques