An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training
Zitian Chen, Mingyu Ding, Yikang Shen, Wei Zhan, Masayoshi Tomizuka,, Erik Learned-Miller, Chuang Gan

TL;DR
This paper introduces a scalable, multi-task vision transformer model trained on heterogeneous datasets, achieving high performance across diverse tasks with modularity for efficient adaptation and continual learning.
Contribution
It proposes a modified mixture-of-experts vision transformer capable of multi-task learning on diverse datasets, addressing heterogeneity challenges and enabling efficient downstream task adaptation.
Findings
Achieves comparable results to single-task models on multiple vision tasks.
Demonstrates strong generalization and modularity for downstream applications.
Enables efficient fine-tuning with fewer parameters and less computation.
Abstract
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently. Despite considerable progress in multi-task learning, most efforts focus on learning from multi-label data: a single image set with multiple task labels. Such multi-label data sets are rare, small, and expensive. We say heterogeneous to refer to image sets with different task labels, or to combinations of single-task datasets. Few have explored training on such heterogeneous datasets. General-purpose vision models are still dominated by single-task pretraining, and it remains unclear how to scale up multi-task models by leveraging mainstream vision datasets designed for different purposes. The challenges lie in managing large intrinsic differences among vision tasks, including data distribution, architectures, task-specific modules, dataset scales, and sampling strategies.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI · Multimodal Machine Learning Applications
MethodsFocus
