Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment

Yuhao Ji; Chao Fang; Shaobo Ma; Haikuo Shao; Zhongfeng Wang

arXiv:2407.12070·cs.LG·May 13, 2025

Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment

Yuhao Ji, Chao Fang, Shaobo Ma, Haikuo Shao, Zhongfeng Wang

PDF

Open Access

TL;DR

This paper presents a comprehensive co-design of a binarized Transformer model and a specialized hardware accelerator, achieving significant improvements in throughput and energy efficiency for edge device deployment.

Contribution

It introduces BMT, a hardware-friendly binarized Transformer with enhanced accuracy, and BAT, a streaming processor accelerator, along with a joint optimization approach for edge deployment.

Findings

01

Up to 49.37x throughput improvement

02

Up to 88.53x energy efficiency gain

03

Effective end-to-end edge deployment demonstrated

Abstract

Transformer models have revolutionized AI tasks, but their large size hinders real-world deployment on resource-constrained and latency-critical edge devices. While binarized Transformers offer a promising solution by significantly reducing model size, existing approaches suffer from algorithm-hardware mismatches with limited co-design exploration, leading to suboptimal performance on edge devices. Hence, we propose a co-design method for efficient end-to-end edge deployment of Transformers from three aspects: algorithm, hardware, and joint optimization. First, we propose BMT, a novel hardware-friendly binarized Transformer with optimized quantization methods and components, and we further enhance its model accuracy by leveraging the weighted ternary weight splitting training technique. Second, we develop a streaming processor mixed binarized Transformer accelerator, namely BAT, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStructural Analysis and Optimization · Advanced MEMS and NEMS Technologies

MethodsAttention Is All You Need · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Ternary Weight Splitting · Adam · Dropout · Multi-Head Attention