Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration

Alfredo Metere

arXiv:2511.18674·cs.PF·November 25, 2025

Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration

Alfredo Metere

PDF

Open Access

TL;DR

Low-Rank GEMM introduces a low-rank approximation method for matrix multiplication that significantly accelerates large-scale computations using FP8 precision, achieving substantial speedups and memory savings on modern GPUs.

Contribution

The paper presents a novel low-rank approximation approach for matrix multiplication that adapts to hardware capabilities, enabling faster and more memory-efficient computations with FP8 acceleration.

Findings

01

Achieves up to 378 TFLOPS on NVIDIA RTX 4090 for large matrices.

02

Provides 75% memory savings compared to traditional methods.

03

Surpasses cuBLAS performance for matrices N≥10240 through memory bandwidth optimization.

Abstract

Large matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., $O (n^{3})$ for a matrix of size $n \times n$ ). We present Low-Rank GEMM, a novel approach that leverages low-rank matrix approximations to achieve sub-quadratic complexity while maintaining hardware-accelerated performance through FP8 precision and intelligent kernel selection. On a NVIDIA RTX 4090, our implementation achieves up to 378 TFLOPS on matrices up to $N = 20480$ , providing 75\% memory savings and $7.8 \times$ speedup over PyTorch FP32 for large matrices. The system automatically adapts to hardware capabilities, selecting optimal decomposition methods (SVD, randomized SVD) and precision levels based on matrix characteristics and available accelerators. Comprehensive benchmarking on NVIDIA RTX 4090 demonstrates that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Tensor decomposition and applications · Parallel Computing and Optimization Techniques