Scaling All-to-all Operations Across Emerging Many-Core Supercomputers
Shannon Kinkead, Jackson Wesley, Whit Schonbein, David DeBonis, Matthew G. F. Dosanjh, and Amanda Bienz

TL;DR
This paper introduces new all-to-all communication algorithms optimized for emerging many-core supercomputers, demonstrating up to three times faster performance than existing MPI implementations on Sapphire Rapids systems.
Contribution
The paper presents novel all-to-all algorithms tailored for many-core architectures and provides a comprehensive performance analysis against existing methods.
Findings
Achieved up to 3x speedup over system MPI
Demonstrated effectiveness on Sapphire Rapids systems
Provided detailed performance analysis of algorithms
Abstract
Performant all-to-all collective operations in MPI are critical to fast Fourier transforms, transposition, and machine learning applications. There are many existing implementations for all-to-all exchanges on emerging systems, with the achieved performance dependent on many factors, including message size, process count, architecture, and parallel system partition. This paper presents novel all-to-all algorithms for emerging many-core systems. Further, the paper presents a performance analysis against existing algorithms and system MPI, with novel algorithms achieving up to 3x speedup over system MPI at 32 nodes of state-of-the-art Sapphire Rapids systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Interconnection Networks and Systems · Embedded Systems Design Techniques
