StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Guosheng Lin,, Chunhua Shen, Ling Chen, Yunchao Wei

TL;DR
StableLLaVA introduces a novel method for visual instruction tuning by synthesizing diverse image-dialogue datasets using generative models, significantly improving multimodal model capabilities and achieving state-of-the-art results.
Contribution
The paper presents a new data collection approach that synthesizes images and dialogues, reducing domain bias and enabling scalable, diverse datasets for enhanced multimodal instruction tuning.
Findings
Significant improvements in over ten capabilities.
State-of-the-art results on multiple benchmarks.
Flexible dataset scaling with generative models.
Abstract
The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked significant interest in the development of multimodal Large Language Models (LLMs). A primary research objective of such models is to align visual and textual modalities effectively while comprehending human instructions. Current methodologies often rely on annotations derived from benchmark datasets to construct image-dialogue datasets for training purposes, akin to instruction tuning in LLMs. However, these datasets often exhibit domain bias, potentially constraining the generative capabilities of the models. In an effort to mitigate these limitations, we propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Softmax · Dense Connections
