TransDot: An Area-efficient Reconfigurable Floating-Point Unit for Trans-Precision Dot-Product Accumulation for FPGA AI Engines
Jiayi Wang, Maohua Nie, Sin-Chen Lin, C.-J. Richard Shi, Ang Li

TL;DR
TransDot introduces a reconfigurable FPGA floating-point unit that efficiently supports trans-precision dot-product accumulation, enhancing throughput and area efficiency for AI engines.
Contribution
It unifies multi-precision SIMD FMA and trans-precision DPA in a shared reconfigurable datapath, supporting various dot-product formats with improved efficiency.
Findings
TransDot doubles FP16 throughput for DPA.
TransDot achieves 4x FP8 and 8x FP4 throughput in DPA.
TransDot improves area efficiency by up to 2.92x.
Abstract
Commercial FPGAs, such as AMD Versal devices, increasingly incorporate AI engines that exploit low-precision packed-SIMD fused multiply-accumulate (FMA) to achieve proportional throughput gains. However, trans-precision FMA (e.g., multiplying two FP16 numbers and adding their result to an FP32 accumulator), which preserves numerical stability by accumulating in higher precision, remains bottlenecked by the highest-precision, lowest-throughput operation. Dot-product accumulation (DPA) (e.g., performing a dot-product on two 4-element FP8 vectors and adding its result to an FP32 accumulator) can fully utilize the input/output bandwidth and computational resources. Existing flexible open-source FPUs, such as FPnew, do not support DPA and implement SIMD FMA on low-precision formats by replicating independent FMA lanes, which increases area, underutilizes shared arithmetic resources, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
