UbiMoE: A Ubiquitous Mixture-of-Experts Vision Transformer Accelerator   With Hybrid Computation Pattern on FPGA

Jiale Dong; Wenqi Lou; Zhendong Zheng; Yunji Qin; Lei Gong; Chao Wang,; Xuehai Zhou

arXiv:2502.05602·cs.AR·February 18, 2025

UbiMoE: A Ubiquitous Mixture-of-Experts Vision Transformer Accelerator With Hybrid Computation Pattern on FPGA

Jiale Dong, Wenqi Lou, Zhendong Zheng, Yunji Qin, Lei Gong, Chao Wang,, Xuehai Zhou

PDF

Open Access 2 Repos

TL;DR

UbiMoE is an FPGA-based accelerator designed specifically for Mixture-of-Experts Vision Transformers, optimizing performance and resource use through tailored kernels and a heuristic hardware tuning algorithm, significantly outperforming existing designs.

Contribution

The paper introduces UbiMoE, a novel FPGA accelerator for MoE-ViT that employs specialized kernels and a heuristic search for hardware tuning, achieving superior throughput and energy efficiency.

Findings

01

Achieves 1.34x and 3.35x throughput improvements on two FPGA platforms.

02

Enhances energy efficiency by 1.75x and 1.54x compared to state-of-the-art.

03

Develops a latency-optimized streaming attention kernel and a resource-efficient linear kernel.

Abstract

Compared to traditional Vision Transformers (ViT), Mixture-of-Experts Vision Transformers (MoE-ViT) are introduced to scale model size without a proportional increase in computational complexity, making them a new research focus. Given the high performance and reconfigurability, FPGA-based accelerators for MoE-ViT emerge, delivering substantial gains over general-purpose processors. However, existing accelerators often fall short of fully exploring the design space, leading to suboptimal trade-offs between resource utilization and performance. To overcome this problem, we introduce UbiMoE, a novel end-to-end FPGA accelerator tailored for MoE-ViT. Leveraging the unique computational and memory access patterns of MoE-ViTs, we develop a latency-optimized streaming attention kernel and a resource-efficient reusable linear kernel, effectively balancing performance and resource consumption.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Image Processing Techniques and Applications · Advanced Memory and Neural Computing