The BLAS API of BLASFEO: optimizing performance for small matrices
Gianluca Frison, Tommaso Sartor, Andrea Zanelli, Moritz Diehl

TL;DR
This paper introduces a BLAS API for BLASFEO that enhances performance for small matrices by optimizing algorithm selection and leveraging a packed matrix format, outperforming traditional BLAS and LAPACK libraries in cache-fitting scenarios.
Contribution
It proposes a standard BLAS API for BLASFEO with multiple optimized algorithms, improving performance for small matrices in embedded and scientific computing.
Findings
BLASFEO outperforms optimized BLAS for small matrices.
The new BLAS API enables better performance in cache-resident matrices.
The approach benefits scientific languages like Julia and SciPy.
Abstract
BLASFEO is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization and other applications targeting relatively small matrices. BLASFEO defines an API which uses a packed matrix format as its native format. This format is analogous to the internal memory buffers of optimized BLAS, but it is exposed to the user and it removes the packing cost from the routine call. For matrices fitting in cache, BLASFEO outperforms optimized BLAS implementations, both open-source and proprietary. This paper investigates the addition of a standard BLAS API to the BLASFEO framework, and proposes an implementation switching between two or more algorithms optimized for different matrix sizes. Thanks to the modular assembly framework in BLASFEO, tailored linear algebra kernels with mixed column- and panel-major arguments are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Distributed and Parallel Computing Systems
