Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

Zhiyang Xu; Chao Feng; Rulin Shao; Trevor Ashby; Ying Shen; Di Jin; Yu; Cheng; Qifan Wang; Lifu Huang

arXiv:2402.11690·cs.CL·February 20, 2024·2 cites

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu, Cheng, Qifan Wang, Lifu Huang

PDF

Open Access 1 Video

TL;DR

Vision-Flan introduces a highly diverse visual instruction dataset and a two-stage tuning process that significantly improves multi-modal model performance and understanding, addressing key challenges like bias and generalizability.

Contribution

The paper presents the largest diverse visual instruction dataset and a novel two-stage instruction tuning framework that outperforms traditional methods.

Findings

01

Two-stage tuning outperforms single-stage tuning.

02

GPT-4 data modulates response formats, not capabilities.

03

A small amount of GPT-4 data effectively aligns responses.

Abstract

Despite vision-language models' (VLMs) remarkable capabilities as versatile visual assistants, two substantial challenges persist within the existing VLM frameworks: (1) lacking task diversity in pretraining and visual instruction tuning, and (2) annotation error and bias in GPT-4 synthesized instruction tuning data. Both challenges lead to issues such as poor generalizability, hallucination, and catastrophic forgetting. To address these challenges, we construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date, comprising 187 diverse tasks and 1,664,261 instances sourced from academic datasets, and each task is accompanied by an expert-written instruction. In addition, we propose a two-stage instruction tuning framework, in which VLMs are firstly finetuned on Vision-Flan and further tuned on GPT-4 synthesized data. We find this two-stage tuning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning· underline

Taxonomy

TopicsVisual and Cognitive Learning Processes

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Label Smoothing · Adam · Softmax · Multi-Head Attention · Layer Normalization · Dropout · Residual Connection