Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications
Ayesha Afzal, Georg Hager, Gerhard Wellein, Stefano Markidis

TL;DR
This paper demonstrates how data analytics and machine learning can effectively identify and classify desynchronization patterns in large-scale MPI parallel applications using minimal data.
Contribution
It introduces novel analysis techniques, including phase space plots, for characterizing MPI program dynamics and desynchronization patterns.
Findings
Desynchronization patterns can be identified from small data sets.
Principal component analysis and clustering reveal program dynamics.
New visualization methods aid in classifying parallel program behavior.
Abstract
This paper studies the utility of using data analytics and machine learning techniques for identifying, classifying, and characterizing the dynamics of large-scale parallel (MPI) programs. To this end, we run microbenchmarks and realistic proxy applications with the regular compute-communicate structure on two different supercomputing platforms and choose the per-process performance and MPI time per time step as relevant observables. Using principal component analysis, clustering techniques, correlation functions, and a new "phase space plot," we show how desynchronization patterns (or lack thereof) can be readily identified from a data set that is much smaller than a full MPI trace. Our methods also lead the way towards a more general classification of parallel program dynamics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Interconnection Networks and Systems · Advanced Data Storage Technologies
