Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration
Alfredo Metere

TL;DR
Low-Rank GEMM introduces a low-rank approximation method for matrix multiplication that significantly accelerates large-scale computations using FP8 precision, achieving substantial speedups and memory savings on modern GPUs.
Contribution
The paper presents a novel low-rank approximation approach for matrix multiplication that adapts to hardware capabilities, enabling faster and more memory-efficient computations with FP8 acceleration.
Findings
Achieves up to 378 TFLOPS on NVIDIA RTX 4090 for large matrices.
Provides 75% memory savings compared to traditional methods.
Surpasses cuBLAS performance for matrices N≥10240 through memory bandwidth optimization.
Abstract
Large matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., for a matrix of size ). We present Low-Rank GEMM, a novel approach that leverages low-rank matrix approximations to achieve sub-quadratic complexity while maintaining hardware-accelerated performance through FP8 precision and intelligent kernel selection. On a NVIDIA RTX 4090, our implementation achieves up to 378 TFLOPS on matrices up to , providing 75\% memory savings and speedup over PyTorch FP32 for large matrices. The system automatically adapts to hardware capabilities, selecting optimal decomposition methods (SVD, randomized SVD) and precision levels based on matrix characteristics and available accelerators. Comprehensive benchmarking on NVIDIA RTX 4090 demonstrates that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Tensor decomposition and applications · Parallel Computing and Optimization Techniques
