Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores
Nikhil Rout, Blaise Tine

TL;DR
Ten-Four introduces a fused dot product unit for GPGPU Tensor Cores that enhances mixed-precision matrix operations, achieving high performance and efficiency on FPGA hardware.
Contribution
It presents a scalable, open-source fused dot product architecture integrating floating-point and integer pipelines within a single unit for GPGPU Tensor Cores.
Findings
Achieves 4-cycle latency at 262.325 MHz Fmax.
Delivers 134.308 GFLOPS peak throughput per Tensor Core.
Provides ~3.1x performance improvement over previous implementations.
Abstract
Efficient mixed-precision matrix multiply accumulate (MMA) operations are critical for accelerating deep learning workloads on GPGPUs. However, existing open-source dot product implementations for Tensor Cores rely on discrete arithmetic units, leading to high latency, accumulated rounding errors, and poor resource utilization. To address these challenges, we propose Ten-Four, a scalable mixed-precision fused dot product unit that integrates both the floating-point and integer arithmetic pipelines within a single fused architecture, implemented as part of the open-source RISC-V-based Vortex GPGPU's Tensor Core Unit extension. Our design supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/INT4 formats and higher-precision accumulation in FP32/INT32, with native support for Microscaling (MX) and sparse lane clock-gating for dynamic power reduction, while matching NVIDIA Tensor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
