Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Mufakir Qamar Ansari (1); Mudabir Qamar Ansari (2) ((1) The University of Toledo; Toledo; OH; USA; (2) Lamar University; Beaumont; TX; USA)

arXiv:2507.19723·cs.DC·July 30, 2025

Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Mufakir Qamar Ansari (1), Mudabir Qamar Ansari (2) ((1) The University of Toledo, Toledo, OH, USA, (2) Lamar University, Beaumont, TX, USA)

PDF

Open Access

TL;DR

This paper empirically compares matrix multiplication performance on a modern consumer-grade platform, demonstrating that GPUs significantly outperform multi-core CPUs, especially for large matrices, with speedups up to 593x.

Contribution

It provides a direct performance comparison of matrix multiplication on CPU and GPU architectures using optimized implementations on a consumer-grade device.

Findings

01

GPU achieves up to 593x speedup over sequential implementation.

02

Parallel CPU provides 12-14x speedup over sequential version.

03

GPU performance scales dramatically with matrix size.

Abstract

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily multi-core CPUs and many-core GPUs, is the established solution, and these systems are now ubiquitous from datacenters to consumer laptops. This paper presents a direct, empirical performance analysis of matrix multiplication on a modern, consumer-grade heterogeneous platform. We implemented and benchmarked three versions of the algorithm: a baseline sequential C++ implementation, a parallel version for its multi-core CPU using OpenMP, and a massively parallel version for its discrete GPU using CUDA with shared memory optimizations. The implementations were evaluated with square matrices of varying dimensions, from 128x128 to 4096x4096. Our results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques · Low-power high-performance VLSI design