Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs

Fumiya Kono; Naohito Nakasato; Maho Nakata

arXiv:2306.04087·cs.DC·June 12, 2023·1 cites

Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs

Fumiya Kono, Naohito Nakasato, Maho Nakata

PDF

Open Access

TL;DR

This paper presents a high-performance FPGA implementation of 128-bit floating-point matrix multiplication, significantly accelerating scientific computations like semidefinite programming and LU decomposition compared to CPU solutions.

Contribution

It introduces a novel FPGA-based binary128 GEMM design that achieves high throughput and enables acceleration of complex numerical applications previously limited by hardware support.

Findings

01

Achieved approximately 90 GFlops on FPGA for binary128 GEMM.

02

Delivered 147x speedup over 20-thread CPU for large matrices.

03

First FPGA acceleration of LU decomposition and SDP problems using binary128 arithmetic.

Abstract

General Matrix Multiplication (GEMM) is a fundamental operation widely used in scientific computations. Its performance and accuracy significantly impact the performance and accuracy of applications that depend on it. One such application is semidefinite programming (SDP), and it often requires binary128 or higher precision arithmetic to solve problems involving SDP stably. However, only some processors support binary128 arithmetic, which makes SDP solvers generally slow. In this study, we focused on accelerating GEMM with binary128 arithmetic on field-programmable gate arrays (FPGAs) to enable the flexible design of accelerators for the desired computations. Our binary128 GEMM designs on a recent high-performance FPGA achieved approximately 90GFlops, 147x faster than the computation executed on a recent CPU with 20 threads for large matrices. Using our binary128 GEMM design on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLow-power high-performance VLSI design · Quantum Computing Algorithms and Architecture · Interconnection Networks and Systems