CoQMoE: Co-Designed Quantization and Computation Orchestration for Mixture-of-Experts Vision Transformer on FPGA
Jiale Dong, Hao Wu, Zihao Wang, Wenqi Lou, Zhendong Zheng, Lei Gong, Chao Wang, Xuehai Zhou

TL;DR
This paper introduces a novel FPGA accelerator for quantized Mixture-of-Experts Vision Transformers, achieving high throughput and energy efficiency with minimal accuracy loss through innovative quantization and resource-aware design.
Contribution
It proposes a dual-stage quantization scheme and a resource-aware FPGA architecture tailored for MoE-ViTs, enabling efficient deployment with high performance and low energy consumption.
Findings
Achieves 155 fps throughput on FPGA.
5.35× throughput improvement over SOTA.
Over 80% energy reduction with <1% accuracy loss.
Abstract
Vision Transformers (ViTs) exhibit superior performance in computer vision tasks but face deployment challenges on resource-constrained devices due to high computational/memory demands. While Mixture-of-Experts Vision Transformers (MoE-ViTs) mitigate this through a scalable architecture with sub-linear computational growth, their hardware implementation on FPGAs remains constrained by resource limitations. This paper proposes a novel accelerator for efficiently implementing quantized MoE models on FPGAs through two key innovations: (1) A dual-stage quantization scheme combining precision-preserving complex quantizers with hardware-friendly simplified quantizers via scale reparameterization, with only 0.28 accuracy loss compared to full precision; (2) A resource-aware accelerator architecture featuring latency-optimized streaming attention kernels and reusable linear operators,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Advanced Neural Network Applications · Visual Attention and Saliency Detection
