M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design
Hanxue Liang, Zhiwen Fan, Rishov Sarkar, Ziyu Jiang, Tianlong Chen,, Kai Zou, Yu Cheng, Cong Hao, Zhangyang Wang

TL;DR
M$^3$ViT introduces a mixture-of-experts vision transformer framework that enables efficient multi-task learning by activating only relevant experts during training and inference, reducing resource use and improving accuracy.
Contribution
The paper presents a novel MoE-based vision transformer design with hardware-aware optimizations for resource-efficient multi-task learning on edge devices.
Findings
Achieves 88% reduction in inference FLOPs.
Reduces memory requirements by 2.4 times on FPGA.
Improves energy efficiency by up to 9.23 times.
Abstract
Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly. However, when deploying MTL onto those real-world systems that are often resource-constrained or latency-sensitive, two prominent challenges arise: (i) during training, simultaneously optimizing all tasks is often difficult due to gradient conflicts across tasks; (ii) at inference, current MTL regimes have to activate nearly the entire model even to just execute a single task. Yet most real systems demand only one or two tasks at each moment, and switch between tasks as needed: therefore such all tasks activated inference is also highly inefficient and non-scalable. In this paper, we present a model-accelerator co-design framework to enable efficient on-device MTL. Our framework, dubbed MViT, customizes mixture-of-experts (MoE) layers into a vision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Linear Layer · Layer Normalization · Residual Connection · Dense Connections · Vision Transformer
