CoQMoE: Co-Designed Quantization and Computation Orchestration for Mixture-of-Experts Vision Transformer on FPGA

Jiale Dong; Hao Wu; Zihao Wang; Wenqi Lou; Zhendong Zheng; Lei Gong; Chao Wang; Xuehai Zhou

arXiv:2506.08496·cs.AR·June 11, 2025

CoQMoE: Co-Designed Quantization and Computation Orchestration for Mixture-of-Experts Vision Transformer on FPGA

Jiale Dong, Hao Wu, Zihao Wang, Wenqi Lou, Zhendong Zheng, Lei Gong, Chao Wang, Xuehai Zhou

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel FPGA accelerator for quantized Mixture-of-Experts Vision Transformers, achieving high throughput and energy efficiency with minimal accuracy loss through innovative quantization and resource-aware design.

Contribution

It proposes a dual-stage quantization scheme and a resource-aware FPGA architecture tailored for MoE-ViTs, enabling efficient deployment with high performance and low energy consumption.

Findings

01

Achieves 155 fps throughput on FPGA.

02

5.35× throughput improvement over SOTA.

03

Over 80% energy reduction with <1% accuracy loss.

Abstract

Vision Transformers (ViTs) exhibit superior performance in computer vision tasks but face deployment challenges on resource-constrained devices due to high computational/memory demands. While Mixture-of-Experts Vision Transformers (MoE-ViTs) mitigate this through a scalable architecture with sub-linear computational growth, their hardware implementation on FPGAs remains constrained by resource limitations. This paper proposes a novel accelerator for efficiently implementing quantized MoE models on FPGAs through two key innovations: (1) A dual-stage quantization scheme combining precision-preserving complex quantizers with hardware-friendly simplified quantizers via scale reparameterization, with only 0.28 $%$ accuracy loss compared to full precision; (2) A resource-aware accelerator architecture featuring latency-optimized streaming attention kernels and reusable linear operators,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dj000011/coqmoe
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Advanced Neural Network Applications · Visual Attention and Saliency Detection