Large Scale Parallelization Using File-Based Communications
Chansup Byun, Jeremy Kepner, William Arcand, David Bestor, Bill, Bergeron, Vijay Gadepally, Michael Houle, Matthew Hubbell, Michael Jones,, Anna Klein, Peter Michaleas, Julie Mullen, Andrew Prout, Antonio Rosa,, Siddharth Samsi, Charles Yee, Albert Reuther

TL;DR
This paper introduces a file-based communication architecture for large-scale parallel computing that reduces filesystem overload and improves performance, achieving significant speedups in MPI message broadcasting.
Contribution
It proposes a novel file-based communication method utilizing local filesystems to enhance scalability and performance in large parallel jobs, addressing filesystem overload issues.
Findings
Achieved 34x performance improvement in MPI_Bcast() with 2048 processes.
Reduced filesystem overload and resource contention in large-scale parallel jobs.
Utilized secure copy protocol for message security without extra security measures.
Abstract
In this paper, we present a novel and new file-based communication architecture using the local filesystem for large scale parallelization. This new approach eliminates the issues with filesystem overload and resource contention when using the central filesystem for large parallel jobs. The new approach incurs additional overhead due to inter-node message file transfers when both the sending and receiving processes are not on the same node. However, even with this additional overhead cost, its benefits are far greater for the overall cluster operation in addition to the performance enhancement in message communications for large scale parallel jobs. For example, when running a 2048-process parallel job, it achieved about 34 times better performance with MPI_Bcast() when using the local filesystem. Furthermore, since the security for transferring message files is handled entirely by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
