Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores

Nikhil Rout; Blaise Tine

arXiv:2512.00053·cs.AR·April 7, 2026

Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores

Nikhil Rout, Blaise Tine

PDF

TL;DR

Ten-Four introduces a fused dot product unit for GPGPU Tensor Cores that enhances mixed-precision matrix operations, achieving high performance and efficiency on FPGA hardware.

Contribution

It presents a scalable, open-source fused dot product architecture integrating floating-point and integer pipelines within a single unit for GPGPU Tensor Cores.

Findings

01

Achieves 4-cycle latency at 262.325 MHz Fmax.

02

Delivers 134.308 GFLOPS peak throughput per Tensor Core.

03

Provides ~3.1x performance improvement over previous implementations.

Abstract

Efficient mixed-precision matrix multiply accumulate (MMA) operations are critical for accelerating deep learning workloads on GPGPUs. However, existing open-source dot product implementations for Tensor Cores rely on discrete arithmetic units, leading to high latency, accumulated rounding errors, and poor resource utilization. To address these challenges, we propose Ten-Four, a scalable mixed-precision fused dot product unit that integrates both the floating-point and integer arithmetic pipelines within a single fused architecture, implemented as part of the open-source RISC-V-based Vortex GPGPU's Tensor Core Unit extension. Our design supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/INT4 formats and higher-precision accumulation in FP32/INT32, with native support for Microscaling (MX) and sparse lane clock-gating for dynamic power reduction, while matching NVIDIA Tensor…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.