TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types
Jiankang Chen, Tianke Zhang, Changyi Liu, Haojie Ding, Yaya Shi, Feng, Cheng, Huihui Xiao, Bin Wen, Fan Yang, Tingting Gao, Di Zhang

TL;DR
TaskGalaxy introduces a large-scale, automatically generated multimodal instruction dataset with over 19,000 task types, significantly enhancing model generalization and performance across diverse vision-language tasks.
Contribution
The paper presents a novel automated method to expand task diversity in multimodal datasets using GPT-4o, reducing manual effort and improving model performance.
Findings
Improved performance on 16 benchmarks with TaskGalaxy-enhanced models.
Automated dataset generation increases task diversity and data quality.
Demonstrates the effectiveness of large-scale, diverse instruction datasets in multimodal models.
Abstract
Multimodal visual language models are gaining prominence in open-world applications, driven by advancements in model architectures, training techniques, and high-quality data. However, their performance is often limited by insufficient task-specific data, leading to poor generalization and biased outputs. Existing efforts to increase task diversity in fine-tuning datasets are hindered by the labor-intensive process of manual task labeling, which typically produces only a few hundred task types. To address this, we propose TaskGalaxy, a large-scale multimodal instruction fine-tuning dataset comprising 19,227 hierarchical task types and 413,648 samples. TaskGalaxy utilizes GPT-4o to enrich task diversity by expanding from a small set of manually defined tasks, with CLIP and GPT-4o filtering those that best match open-source images, and generating relevant question-answer pairs. Multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
