Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu, Cheng, Qifan Wang, Lifu Huang

TL;DR
Vision-Flan introduces a highly diverse visual instruction dataset and a two-stage tuning process that significantly improves multi-modal model performance and understanding, addressing key challenges like bias and generalizability.
Contribution
The paper presents the largest diverse visual instruction dataset and a novel two-stage instruction tuning framework that outperforms traditional methods.
Findings
Two-stage tuning outperforms single-stage tuning.
GPT-4 data modulates response formats, not capabilities.
A small amount of GPT-4 data effectively aligns responses.
Abstract
Despite vision-language models' (VLMs) remarkable capabilities as versatile visual assistants, two substantial challenges persist within the existing VLM frameworks: (1) lacking task diversity in pretraining and visual instruction tuning, and (2) annotation error and bias in GPT-4 synthesized instruction tuning data. Both challenges lead to issues such as poor generalizability, hallucination, and catastrophic forgetting. To address these challenges, we construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date, comprising 187 diverse tasks and 1,664,261 instances sourced from academic datasets, and each task is accompanied by an expert-written instruction. In addition, we propose a two-stage instruction tuning framework, in which VLMs are firstly finetuned on Vision-Flan and further tuned on GPT-4 synthesized data. We find this two-stage tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVisual and Cognitive Learning Processes
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Adam · Softmax · Multi-Head Attention · Layer Normalization · Dropout · Residual Connection
