TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of   Thousands Vision Task Types

Jiankang Chen; Tianke Zhang; Changyi Liu; Haojie Ding; Yaya Shi; Feng; Cheng; Huihui Xiao; Bin Wen; Fan Yang; Tingting Gao; Di Zhang

arXiv:2502.09925·cs.CV·February 17, 2025

TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types

Jiankang Chen, Tianke Zhang, Changyi Liu, Haojie Ding, Yaya Shi, Feng, Cheng, Huihui Xiao, Bin Wen, Fan Yang, Tingting Gao, Di Zhang

PDF

Open Access

TL;DR

TaskGalaxy introduces a large-scale, automatically generated multimodal instruction dataset with over 19,000 task types, significantly enhancing model generalization and performance across diverse vision-language tasks.

Contribution

The paper presents a novel automated method to expand task diversity in multimodal datasets using GPT-4o, reducing manual effort and improving model performance.

Findings

01

Improved performance on 16 benchmarks with TaskGalaxy-enhanced models.

02

Automated dataset generation increases task diversity and data quality.

03

Demonstrates the effectiveness of large-scale, diverse instruction datasets in multimodal models.

Abstract

Multimodal visual language models are gaining prominence in open-world applications, driven by advancements in model architectures, training techniques, and high-quality data. However, their performance is often limited by insufficient task-specific data, leading to poor generalization and biased outputs. Existing efforts to increase task diversity in fine-tuning datasets are hindered by the labor-intensive process of manual task labeling, which typically produces only a few hundred task types. To address this, we propose TaskGalaxy, a large-scale multimodal instruction fine-tuning dataset comprising 19,227 hierarchical task types and 413,648 samples. TaskGalaxy utilizes GPT-4o to enrich task diversity by expanding from a small set of manually defined tasks, with CLIP and GPT-4o filtering those that best match open-source images, and generating relevant question-answer pairs. Multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques