Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment
Yuhao Ji, Chao Fang, Shaobo Ma, Haikuo Shao, Zhongfeng Wang

TL;DR
This paper presents a comprehensive co-design of a binarized Transformer model and a specialized hardware accelerator, achieving significant improvements in throughput and energy efficiency for edge device deployment.
Contribution
It introduces BMT, a hardware-friendly binarized Transformer with enhanced accuracy, and BAT, a streaming processor accelerator, along with a joint optimization approach for edge deployment.
Findings
Up to 49.37x throughput improvement
Up to 88.53x energy efficiency gain
Effective end-to-end edge deployment demonstrated
Abstract
Transformer models have revolutionized AI tasks, but their large size hinders real-world deployment on resource-constrained and latency-critical edge devices. While binarized Transformers offer a promising solution by significantly reducing model size, existing approaches suffer from algorithm-hardware mismatches with limited co-design exploration, leading to suboptimal performance on edge devices. Hence, we propose a co-design method for efficient end-to-end edge deployment of Transformers from three aspects: algorithm, hardware, and joint optimization. First, we propose BMT, a novel hardware-friendly binarized Transformer with optimized quantization methods and components, and we further enhance its model accuracy by leveraging the weighted ternary weight splitting training technique. Second, we develop a streaming processor mixed binarized Transformer accelerator, namely BAT, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStructural Analysis and Optimization · Advanced MEMS and NEMS Technologies
MethodsAttention Is All You Need · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Ternary Weight Splitting · Adam · Dropout · Multi-Head Attention
