Parallel sparse matrix-vector multiplication as a test case for hybrid MPI+OpenMP programming
Gerald Schubert, Georg Hager, Holger Fehske, Gerhard Wellein

TL;DR
This paper investigates optimized parallel sparse matrix-vector multiplication on multicore clusters, demonstrating that explicit communication-computation overlap via a dedicated thread improves performance over traditional MPI and hybrid strategies.
Contribution
It introduces a hybrid MPI+OpenMP approach with a dedicated communication thread to better overlap communication and computation in sparse matrix-vector multiplication.
Findings
Explicit communication overlap improves performance
Dedicated communication thread outperforms standard MPI
Hybrid approach surpasses pure MPI in scalability
Abstract
We evaluate optimized parallel sparse matrix-vector operations for two representative application areas on widespread multicore-based cluster configurations. First the single-socket baseline performance is analyzed and modeled with respect to basic architectural properties of standard multicore chips. Going beyond the single node, parallel sparse matrix-vector operations often suffer from an unfavorable communication to computation ratio. Starting from the observation that nonblocking MPI is not able to hide communication cost using standard MPI implementations, we demonstrate that explicit overlap of communication and computation can be achieved by using a dedicated communication thread, which may run on a virtual core. We compare our approach to pure MPI and the widely used "vector-like" hybrid programming strategy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
