Balancing FP8 Computation Accuracy and Efficiency on Digital CIM via Shift-Aware On-the-fly Aligned-Mantissa Bitwidth Prediction
Liang Zhao, Kunming Shao, Zhipeng Liao, Xijie Huang, Tim Kwang-Ting Cheng, Chi-Ying Tsui, Yi Zou

TL;DR
This paper introduces a flexible FP8 DCIM accelerator with dynamic bitwidth prediction and input alignment, significantly improving efficiency and adaptability for Transformer inference and training.
Contribution
The work presents a novel shift-aware on-the-fly bitwidth prediction method and a scalable MAC array, enabling adaptive FP8 precision in digital compute-in-memory architectures.
Findings
Achieves 20.4 TFLOPS/W in 28nm CMOS implementation.
Supports all FP8 formats with 2.8× higher efficiency than prior work.
Demonstrates improved accuracy-efficiency trade-offs on Llama-7b datasets.
Abstract
FP8 low-precision formats have gained significant adoption in Transformer inference and training. However, existing digital compute-in-memory (DCIM) architectures face challenges in supporting variable FP8 aligned-mantissa bitwidths, as unified alignment strategies and fixed-precision multiply-accumulate (MAC) units struggle to handle input data with diverse distributions. This work presents a flexible FP8 DCIM accelerator with three innovations: (1) a dynamic shift-aware bitwidth prediction (DSBP) with on-the-fly input prediction that adaptively adjusts weight (2/4/6/8b) and input (212b) aligned-mantissa precision; (2) a FIFO-based input alignment unit (FIAU) replacing complex barrel shifters with pointer-based control; and (3) a precision-scalable INT MAC array achieving flexible weight precision with minimal overhead. Implemented in 28nm CMOS with a 6496 CIM array, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Memory and Neural Computing · Low-power high-performance VLSI design
