Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU
Mufakir Qamar Ansari (1), Mudabir Qamar Ansari (2) ((1) The University of Toledo, Toledo, OH, USA, (2) Lamar University, Beaumont, TX, USA)

TL;DR
This paper empirically compares matrix multiplication performance on a modern consumer-grade platform, demonstrating that GPUs significantly outperform multi-core CPUs, especially for large matrices, with speedups up to 593x.
Contribution
It provides a direct performance comparison of matrix multiplication on CPU and GPU architectures using optimized implementations on a consumer-grade device.
Findings
GPU achieves up to 593x speedup over sequential implementation.
Parallel CPU provides 12-14x speedup over sequential version.
GPU performance scales dramatically with matrix size.
Abstract
Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily multi-core CPUs and many-core GPUs, is the established solution, and these systems are now ubiquitous from datacenters to consumer laptops. This paper presents a direct, empirical performance analysis of matrix multiplication on a modern, consumer-grade heterogeneous platform. We implemented and benchmarked three versions of the algorithm: a baseline sequential C++ implementation, a parallel version for its multi-core CPU using OpenMP, and a massively parallel version for its discrete GPU using CUDA with shared memory optimizations. The implementations were evaluated with square matrices of varying dimensions, from 128x128 to 4096x4096. Our results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques · Low-power high-performance VLSI design
