M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task   Learning with Model-Accelerator Co-design

Hanxue Liang; Zhiwen Fan; Rishov Sarkar; Ziyu Jiang; Tianlong Chen,; Kai Zou; Yu Cheng; Cong Hao; Zhangyang Wang

arXiv:2210.14793·cs.CV·October 27, 2022·33 cites

M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design

Hanxue Liang, Zhiwen Fan, Rishov Sarkar, Ziyu Jiang, Tianlong Chen,, Kai Zou, Yu Cheng, Cong Hao, Zhangyang Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

M$^3$ViT introduces a mixture-of-experts vision transformer framework that enables efficient multi-task learning by activating only relevant experts during training and inference, reducing resource use and improving accuracy.

Contribution

The paper presents a novel MoE-based vision transformer design with hardware-aware optimizations for resource-efficient multi-task learning on edge devices.

Findings

01

Achieves 88% reduction in inference FLOPs.

02

Reduces memory requirements by 2.4 times on FPGA.

03

Improves energy efficiency by up to 9.23 times.

Abstract

Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly. However, when deploying MTL onto those real-world systems that are often resource-constrained or latency-sensitive, two prominent challenges arise: (i) during training, simultaneously optimizing all tasks is often difficult due to gradient conflicts across tasks; (ii) at inference, current MTL regimes have to activate nearly the entire model even to just execute a single task. Yet most real systems demand only one or two tasks at each moment, and switch between tasks as needed: therefore such all tasks activated inference is also highly inefficient and non-scalable. In this paper, we present a model-accelerator co-design framework to enable efficient on-device MTL. Our framework, dubbed M $^{3}$ ViT, customizes mixture-of-experts (MoE) layers into a vision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vita-group/m3vit
pytorchOfficial

Videos

M³ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Linear Layer · Layer Normalization · Residual Connection · Dense Connections · Vision Transformer