StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized   Image-Dialogue Data

Yanda Li; Chi Zhang; Gang Yu; Zhibin Wang; Bin Fu; Guosheng Lin,; Chunhua Shen; Ling Chen; Yunchao Wei

arXiv:2308.10253·cs.CV·December 29, 2023·2 cites

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Guosheng Lin,, Chunhua Shen, Ling Chen, Yunchao Wei

PDF

Open Access 1 Repo

TL;DR

StableLLaVA introduces a novel method for visual instruction tuning by synthesizing diverse image-dialogue datasets using generative models, significantly improving multimodal model capabilities and achieving state-of-the-art results.

Contribution

The paper presents a new data collection approach that synthesizes images and dialogues, reducing domain bias and enabling scalable, diverse datasets for enhanced multimodal instruction tuning.

Findings

01

Significant improvements in over ten capabilities.

02

State-of-the-art results on multiple benchmarks.

03

Flexible dataset scaling with generative models.

Abstract

The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked significant interest in the development of multimodal Large Language Models (LLMs). A primary research objective of such models is to align visual and textual modalities effectively while comprehending human instructions. Current methodologies often rely on annotations derived from benchmark datasets to construct image-dialogue datasets for training purposes, akin to instruction tuning in LLMs. However, these datasets often exhibit domain bias, potentially constraining the generative capabilities of the models. In an effort to mitigate these limitations, we propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

icoz69/stablellava
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Softmax · Dense Connections