Balancing FP8 Computation Accuracy and Efficiency on Digital CIM via Shift-Aware On-the-fly Aligned-Mantissa Bitwidth Prediction

Liang Zhao; Kunming Shao; Zhipeng Liao; Xijie Huang; Tim Kwang-Ting Cheng; Chi-Ying Tsui; Yi Zou

arXiv:2602.05743·cs.AR·May 19, 2026

Balancing FP8 Computation Accuracy and Efficiency on Digital CIM via Shift-Aware On-the-fly Aligned-Mantissa Bitwidth Prediction

Liang Zhao, Kunming Shao, Zhipeng Liao, Xijie Huang, Tim Kwang-Ting Cheng, Chi-Ying Tsui, Yi Zou

PDF

TL;DR

This paper introduces a flexible FP8 DCIM accelerator with dynamic bitwidth prediction and input alignment, significantly improving efficiency and adaptability for Transformer inference and training.

Contribution

The work presents a novel shift-aware on-the-fly bitwidth prediction method and a scalable MAC array, enabling adaptive FP8 precision in digital compute-in-memory architectures.

Findings

01

Achieves 20.4 TFLOPS/W in 28nm CMOS implementation.

02

Supports all FP8 formats with 2.8× higher efficiency than prior work.

03

Demonstrates improved accuracy-efficiency trade-offs on Llama-7b datasets.

Abstract

FP8 low-precision formats have gained significant adoption in Transformer inference and training. However, existing digital compute-in-memory (DCIM) architectures face challenges in supporting variable FP8 aligned-mantissa bitwidths, as unified alignment strategies and fixed-precision multiply-accumulate (MAC) units struggle to handle input data with diverse distributions. This work presents a flexible FP8 DCIM accelerator with three innovations: (1) a dynamic shift-aware bitwidth prediction (DSBP) with on-the-fly input prediction that adaptively adjusts weight (2/4/6/8b) and input (2 $\sim$ 12b) aligned-mantissa precision; (2) a FIFO-based input alignment unit (FIAU) replacing complex barrel shifters with pointer-based control; and (3) a precision-scalable INT MAC array achieving flexible weight precision with minimal overhead. Implemented in 28nm CMOS with a 64 $\times$ 96 CIM array, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Memory and Neural Computing · Low-power high-performance VLSI design