Evaluation of computational and energy performance in matrix   multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

L.A. Torres; Carlos J. Barrios H; Yves Denneulin

arXiv:2405.17322·cs.DC·May 28, 2024·1 cites

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

L.A. Torres, Carlos J. Barrios H, Yves Denneulin

PDF

Open Access 1 Repo

TL;DR

This study compares the computational speed, energy efficiency, and accuracy of matrix multiplication algorithms on CPUs and GPUs using MKL, cuBLAS, and SYCL, highlighting trade-offs across different hardware and implementations.

Contribution

It provides a comprehensive performance and accuracy comparison of matrix multiplication libraries and implementations on various CPU and GPU architectures.

Findings

01

MKL offers the best performance with slight accuracy loss.

02

OpenMP and SYCL on CPU achieve high accuracy but lower performance.

03

cuBLAS with tensor cores has top performance but reduced accuracy.

Abstract

Matrix multiplication is fundamental in the backpropagation algorithm used to train deep neural network models. Libraries like Intel's MKL or NVIDIA's cuBLAS implemented new and optimized matrix multiplication techniques that increase performance and reduce computational costs. These techniques can also be implemented in CUDA and SYCL and functions with AVX2 and AVX512 instructions, which have lower performance but better precision. The study compares execution times and power consumption using PAPI and PERF and compares accuracy for different matrix sizes. Comparisons were made on architectures such as third and fourth-generation Intel CPUs and NVIDIA V100 and A100 GPUs. The MKL library showed the best performance with a slight loss of precision, while OpenMP and SYCL on the CPU implementation showed the best accuracy but a loss of performance. On the other hand, the results on GPU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alejandrotorresn/MatMul
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Interconnection Networks and Systems · Distributed and Parallel Computing Systems