Parallel Performance of Molecular Dynamics Trajectory Analysis

Mahzad Khoshlessan; Ioannis Paraskevakos; Geoffrey C. Fox and; Shantenu Jha; Oliver Beckstein

arXiv:1907.00097·cs.DC·March 31, 2020

Parallel Performance of Molecular Dynamics Trajectory Analysis

Mahzad Khoshlessan, Ioannis Paraskevakos, Geoffrey C. Fox and, Shantenu Jha, Oliver Beckstein

PDF

1 Repo

TL;DR

This paper investigates parallel trajectory analysis in molecular dynamics, identifying bottlenecks like file I/O and communication, and demonstrates near-ideal scaling with Parallel HDF5, significantly speeding up analysis on supercomputers.

Contribution

It provides a detailed analysis of scaling bottlenecks in parallel MD trajectory analysis and introduces effective solutions like subfiling and Parallel HDF5 for improved performance.

Findings

01

Parallel HDF5 achieves near-ideal scaling up to 384 cores.

02

File I/O and MPI communication are primary bottlenecks.

03

Trajectory analysis times reduced by two orders of magnitude.

Abstract

The performance of biomolecular molecular dynamics simulations has steadily increased on modern high performance computing resources but acceleration of the analysis of the output trajectories has lagged behind so that analyzing simulations is becoming a bottleneck. To close this gap, we studied the performance of parallel trajectory analysis with MPI and the Python MDAnalysis library on three different XSEDE supercomputers where trajectories were read from a Lustre parallel file system. Strong scaling performance was impeded by stragglers, MPI processes that were slower than the typical process. Stragglers were less prevalent for compute-bound workloads, thus pointing to file reading as a bottleneck for scaling. However, a more complicated picture emerged in which both the computation and the data ingestion exhibited close to ideal strong scaling behavior whereas stragglers were…

Tables3

Table 1. Table 3 : Summary of measured timing quantities. Timings are collected for the specified line numbers in the code, labeled as t L n subscript 𝑡 L n t_{\text{L$n$}} where L n 𝑛 n refers to the line number in the corresponding algorithm (columns Algorithm 1 and 2 ), or are calculated in the same way for both algorithms from the specific quantities. Variables in the top part of the table refer to measurements of an individual MPI rank. Variables in the bottom part are aggregates such as averages over all ranks or the total time to solution.

Quantity	Definition		Description
	Algorithm 1	Algorithm 2
$t_{opening_trajectory}$	$t_{L2} + t_{L3}$	—^a	file opening and data structure initialization
$t_{I/O}^{frame}$	$t_{L4}$	$t_{L2}$	data reading per frame
$t_{comp}^{frame}$	$t_{L5}$	$t_{L3}$	compute per frame
$t_{all_frame}$	$t_{L4} + t_{L5} + t_{L6}$	$t_{L2} + t_{L3} + t_{L4}$	time to analyze one frame
$t_{RMSD}$	$t_{L1} + \dots + t_{L8}$	$t_{L1} + \dots + t_{L6}$	time to execute the body of the block_rmsd() function
$t_{end_loop}$	$t_{L6}$	$t_{L4}$	closing of the trajectory at the end of the loop
$t_{comm}$	$t_{L16}$	$t_{L15}$	data communication with MPI_Gather()
$N_{frames}^{total}$			total number of trajectory frames
$N$			total number of MPI ranks (processes), equals the number of trajectory blocks
$N_{b}$	$N_{frames}^{total} / N$		number of frames per block
$t_{comp}$	$\sum_{frame = 1}^{N_{b}} t_{comp}^{frame}$		total compute time for a rank (process)
$t_{I/O}$	$\sum_{frame = 1}^{N_{b}} t_{I/O}^{frame}$		total read I/O time for a rank (process)
$t_{Overhead1}$	$t_{all_frame} - t_{I/O} - t_{comp} - t_{end_loop}$		time inside block_rmsd() that was not measured explicitly
$t_{Overhead2}$	$t_{RMSD} - t_{all_frame} - t_{opening_trajectory}$		overhead for the block_rmsd() function call
$t_{N}$	$t_{RMSD} + t_{comm}$		total time to completion for a rank (process)
$\bar{t_{comp}}$	$\frac{1}{N} \sum_{rank = 1}^{N} t_{comp}$		average compute time over all ranks
$\bar{t_{I/O}}$	$\frac{1}{N} \sum_{rank = 1}^{N} t_{I/O}$		average read I/O time over all ranks
$\bar{t_{comm}}$	$\frac{1}{N} \sum_{rank = 1}^{N} t_{comm}$		average communication time over all ranks
$t_{total}$	$\max t_{N}$		total time to solution

Table 2. Table 4 : The wall-clock time spent for writing N seg subscript 𝑁 seg N_{\text{seg}} trajectory segments using N p subscript 𝑁 p N_{\text{p}} processes using MPI on SDSC Comet . One set of runs was performed for the timings. Scaling S 𝑆 S and efficiency E 𝐸 E are relative to the 1 node case (24 MPI processes).

$N_{seg}$	$N_{p}$	nodes	time (s)	$S$	$E$
24	24	1	89.9	1.0	1.0
48	48	2	46.8	1.9	0.96
72	72	3	33.7	2.7	0.89
96	96	4	25.1	3.6	0.89
144	144	6	43.7	2.1	0.34
192	192	8	13.5	6.7	0.83

Table 3. Table 6 : Change in R comp/IO subscript 𝑅 comp/IO R_{\text{comp/IO}} ratio with change in the RMSD workload X 𝑋 X . The RMSD workload was artificially increased in order to examine the effect of compute to I/O ratio on the performance. The reported compute and I/O time were measured based on the serial version using one core. The theoretical R comp/IO subscript 𝑅 comp/IO R_{\text{comp/IO}} (see text) is provided for comparison.

Workload $X$	$t_{comp}$ (s)	$t_{I/O}$ (s)	$R_{comp/IO}$
			measured	theoretical
$1 \times$	226	791	0.29
$40 \times$	8655	791	11	11
$70 \times$	15148	791	19	20
$100 \times$	21639	791	27	29

Equations16

RMSD (t) = R, t min \frac{1}{N} i = 1 \sum N [(R \cdot x_{i} (t) + t) - x_{i}^{ref}]^{2}

RMSD (t) = R, t min \frac{1}{N} i = 1 \sum N [(R \cdot x_{i} (t) + t) - x_{i}^{ref}]^{2}

S (N) = \frac{t _{total} ( 1 )}{t _{total} ( N )},

S (N) = \frac{t _{total} ( 1 )}{t _{total} ( N )},

E (N) = \frac{S ( N )}{N} .

E (N) = \frac{S ( N )}{N} .

\overline{t_{comp}} = \frac{1}{N} rank = 1 \sum N t_{comp} = \frac{1}{N} rank = 1 \sum N frame = 1 \sum N_{b} t_{comp}^{frame},

\overline{t_{comp}} = \frac{1}{N} rank = 1 \sum N t_{comp} = \frac{1}{N} rank = 1 \sum N frame = 1 \sum N_{b} t_{comp}^{frame},

\overline{t_{I/O}} = \frac{1}{N} rank = 1 \sum N t_{I/O} = \frac{1}{N} rank = 1 \sum N frame = 1 \sum N_{b} t_{I/O}^{frame},

\overline{t_{I/O}} = \frac{1}{N} rank = 1 \sum N t_{I/O} = \frac{1}{N} rank = 1 \sum N frame = 1 \sum N_{b} t_{I/O}^{frame},

\overline{t_{comm}} = \frac{1}{N} rank = 1 \sum N t_{comm} .

\overline{t_{comm}} = \frac{1}{N} rank = 1 \sum N t_{comm} .

R_{comp/IO} = \frac{t _{comp}}{t _{I/O}} = \frac{t _{comp} / N _{frames}^{total}}{t _{I/O} / N _{frames}^{total}} = \frac{t _{comp}^{frame}}{t _{I/O}^{frame}}

R_{comp/IO} = \frac{t _{comp}}{t _{I/O}} = \frac{t _{comp} / N _{frames}^{total}}{t _{I/O} / N _{frames}^{total}} = \frac{t _{comp}^{frame}}{t _{I/O}^{frame}}

R_{comp/comm} = \frac{t _{comp}}{t _{comm}} .

R_{comp/comm} = \frac{t _{comp}}{t _{comm}} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hpcanalytics/supplement-hpc-py-parallel-mdanalysis
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\authormark

KHOSHLESSAN et al

\corres

*Oliver Beckstein, \orgdivDepartment of Physics, \orgnameArizona State University, \orgaddressTempe, AZ 85281, \countryUSA.

Parallel Performance of Molecular Dynamics Trajectory Analysis

Mahzad Khoshlessan

Ioannis Paraskevakos

Geoffrey C. Fox

Shantenu Jha

Oliver Beckstein

\orgdivDepartment of Physics, \orgnameArizona State University, \orgaddressTempe, AZ 85281, \countryUSA

\orgdivDepartment of Electrical & Computer Engineering, \orgnameRutgers University, \orgaddressPiscataway, NJ 08854, \countryUSA

\orgdivDigital Science Center, \orgnameIndiana University, \orgaddressBloomington, IN 47405, \countryUSA

\orgdivCenter for Biological Physics, \orgnameArizona State University, \orgaddressTempe, AZ 85281, \countryUSA

[email protected]

(Added at production; Added at production; Added at production)

Abstract

[Summary]The performance of biomolecular molecular dynamics simulations has steadily increased on modern high performance computing resources but acceleration of the analysis of the output trajectories has lagged behind so that analyzing simulations is becoming a bottleneck. To close this gap, we studied the performance of parallel trajectory analysis with MPI and the Python MDAnalysis library on three different XSEDE supercomputers where trajectories were read from a Lustre parallel file system. Strong scaling performance was impeded by stragglers, MPI processes that were slower than the typical process. Stragglers were less prevalent for compute-bound workloads, thus pointing to file reading as a bottleneck for scaling. However, a more complicated picture emerged in which both the computation and the data ingestion exhibited close to ideal strong scaling behavior whereas stragglers were primarily caused by either large MPI communication costs or long times to open the single shared trajectory file. We improved overall strong scaling performance by either subfiling (splitting the trajectory into separate files) or MPI-IO with Parallel HDF5 trajectory files. The parallel HDF5 approach resulted in near ideal strong scaling on up to 384 cores (16 nodes), thus reducing trajectory analysis times by two orders of magnitude compared to the serial approach.

keywords:

Python, MPI, HPC, MDAnalysis, MPI I/O, HDF5, Straggler, Molecular Dynamics, Big Data, Trajectory Analysis

††articletype: Research Article

1 Introduction

Molecular dynamics (MD) simulations are a powerful method to generate new insights into the function of biomolecules 1, 2, 3, 4, 5. These simulations produce trajectories—time series of atomic coordinates—that now routinely include millions of time steps and can measure Terabytes in size. These trajectories need to be analyzed using statistical mechanics approaches 6, 7 but because of the increasing size of the data, trajectory analysis is becoming a bottleneck in typical biomolecular simulation scientific workflows 8. Many data analysis tools and libraries have been developed to extract the desired information from the output trajectories from MD simulations 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22 but few can efficiently use modern High Performance Computing (HPC) resources to accelerate the analysis stage. MD trajectory analysis primarily requires reading of data from the file system; the processed output data are typically negligible in size compared to the input data and therefore we exclusively investigate the reading aspects of trajectory I/O (i.e., the “I”). We focus on the MDAnalysis package 18, 17, which is an open-source object-oriented Python library for structural and temporal analysis of MD simulation trajectories and individual protein structures. Although MDAnalysis accelerates selected algorithms with OpenMP, it is not clear how to best use it for scaling up analysis on multi-node supercomputers. Here we discuss the challenges and lessons-learned for making parallel analysis on HPC resources feasible with MDAnalysis, which should also be broadly applicable to other general purpose trajectory analysis libraries.

Previously, we had used a parallel split-apply-combine approach 23 to study the performance of the commonly performed “RMSD fitting” analysis problem 24, 25, 26, which calculates the minimal root mean squared distance (RMSD) of the positions of a subset of atoms to a reference conformation under optimization of rigid body translations and rotations 27, 28, 7. We had investigated two parallel implementations, one using Dask 29 and one using the message passing interface (MPI) with mpi4py 30, 31. For both Dask and MPI, we had previously only been able to obtain good strong scaling performance within a single node. Beyond a single node performance had dropped due to straggler tasks, a subset of tasks that had performed abnormally slower than the typical task execution times; the total execution time had become dominated by stragglers and overall performance had decreased. Stragglers are a well-known challenge to improving performance on HPC resources 32 but there has been little discussion of their impact in the biomolecular simulation community.

In the present study, we analyzed the MPI case in more detail to better understand the origin of stragglers with the goal to find parallelization approaches to speed up parallel post-processing of MD trajectories in the MDAnalysis library. We especially wanted to make efficient use of the resources provided by current supercomputers such as multiple nodes with hundreds of CPU cores and a Lustre parallel file system.

As in our previous study24 we selected the commonly used RMSD algorithm implemented in MDAnalysis as a typical use case. We employed the single program multiple data (SPMD) paradigm to parallelize this algorithm on three different HPC resources (XSEDE’s SDSC Comet, LSU SuperMic, and PSC Bridges 33). With SPMD, each process executes essentially the same operations on different parts of the data. The three clusters differed in their architecture but all used Lustre as their parallel file system. We used Python (https://www.python.org/), a machine-independent, byte-code interpreted, object-oriented programming language, which is well-established in the biomolecular simulation community with good support for parallel programming for HPC 30, 34. We found that communication and reading I/O were the two main scalability bottlenecks, with some indication that read I/O might have been interfering with the communications. Stragglers and therefore poor scaling were a consequence of inefficient use of the parallel Lustre file system. We therefore focused on two different approaches to better leverage Lustre and mitigate I/O bottlenecks: MPI parallel I/O (MPI-IO) with the HDF5 file format and subfiling (trajectory file splitting). MPI-IO eliminated stragglers and improved the performance with near ideal scaling, $S(N)=N$ , i.e., the speed-up $S$ scaled linearly with the number $N$ of CPU cores while exhibiting a slope of one.

The paper is organized as follows: We first review stragglers and existing approaches to parallelizing MD trajectory analysis in section 2. We describe the software packages and algorithms in section 3 and the benchmarking environment in section 4. Section 5 explains how we measured performance. The main results are presented in section 6, with section 7 demonstrating reproducibility on different supercomputers. We provide general guidelines and lessons-learned in section 8 and finish with conclusions in section 9.

2 Background and Related Work

In our previous work, we found that straightforward implementation of simple parallelization with a split-apply-combine algorithm in Python failed to scale beyond a single compute node 24 because a few tasks (MPI-ranks or Dask 29 processes) took much longer than the typical task and so limited the overall performance. However, the cause for these straggler tasks remained obscure. Here, we studied the straggler problem in the context of an MPI-parallelized trajectory analysis algorithm in Python and investigated solutions to overcome it. We briefly review stragglers in section 2.1 and summarize existing approaches to parallel trajectory analysis in section 2.2.

2.1 Stragglers

Stragglers or outliers were traditionally considered in the context of MapReduce jobs that consist of multiple tasks that all have to finish for the job to succeed: A straggler was a task that took an “unusually long time to complete” 35 and therefore substantially impeded job completion. In general, any component of a parallel workflow whose runtime exceeds a typical run time (for example, 1.5 times the median runtime) can be considered a straggler 36. Stragglers are a challenge for improving performance on HPC resources 32; they are a known problem in frameworks such as MapReduce 35, 36, Spark 37, 38, 39, 40, Hadoop 35, cloud data centers 41, 32, and have a high impact on performance and energy consumption of big data systems 42. Both internal and external factors are known to contribute to stragglers. Internal factors include heterogeneous capacity of worker nodes and resource competition due to other tasks running on the same worker node. External factors include resource competition due to co-hosted applications, input data skew, remote input or output source being too slow, faulty hardware 43, 35, and node mis-configuration 35. Competition over scarce resources 36, in particular the network bandwidth, was found to lead to stragglers in writing on Lustre file systems 44. Garbage collection 37, 38, Java virtual machine (JVM) positioning to cores 37, delays introduced while the tasks move from the scheduler to execution 39, disk I/O during shuffling, Java’s just-in-time compilation 38, output skew 38, high CPU utilization, disk utilization, unhandled I/O access requests, and network package loss 32 were also among other external factors that might introduce stragglers. A wide variety of approaches have been investigated for detecting and mitigating stragglers, including tuning resource allocation and parallelism such as breaking the workload into many small tasks that are dynamically scheduled at runtime 45, slow Node-Threshold 35, speculative execution 35 and cause/resource-aware task management 36, sampling or data distribution estimation techniques, SkewTune to avoid data imbalance 46, dynamic work rebalancing 41, blocked time analysis 47, and intelligent scheduling 48.

In the present study, we analyzed MD trajectories in parallel with MPI and Python and observed large variations in the completion time of individual MPI ranks. These variations bore some similarity to the straggler tasks observed in MapReduce frameworks so we approached analyzing and eliminating them in a similar fashion by systematically looking at different components of the problem, including read I/O from the shared Lustre file system and MPI communication. Even though we specifically worked in with the MDAnalysis package, all these principles and techniques are potentially applicable to MPI-parallelized data analysis in other Python-based libraries.

2.2 Other Packages with Parallel Analysis Capabilities

Different approaches to parallelizing the analysis of MD trajectories have been proposed. HiMach 14 introduces scalable and flexible parallel Python framework to deal with massive MD trajectories, by combining and extending Google’s MapReduce and the VMD analysis tool 11. HiMach’s runtime is responsible for parallelizing and distributing Map and Reduce classes to assigned cores. HiMach uses parallel I/O for file access during map tasks and MPI_Allgather in the reduction process. HiMach, however, does not discuss parallel analysis of analysis types that cannot be implemented via MapReduce. Furthermore, HiMach is not available under an open source license, which makes it difficult to integrate community contributions and add new state-of-the-art methods.

Wu et. al. 49 present a scalable parallel framework for distributed-memory MD simulation data analysis. This work consists of an interface that allows a user to write analysis programs sequentially, and the machinery that ensures these programs execute in parallel automatically. Parallelization is performed over domains in the simulation system via domain decomposition and the introduction of ghost atoms to include appropriate nearest neighbor interactions. The HDF5 file format is used for parallel reading and writing. This work focuses on applications in the materials science and does not consider parallelization over trajectory blocks.

Zazen 50 is a novel task-assignment protocol to overcome the I/O bottleneck for many I/O bound tasks. This protocol caches a copy of simulation output files on the local disks of the compute nodes of a cluster, and uses co-located data access with computation. Zazen is implemented in a parallel disk cache system and avoids the overhead associated with querying metadata servers by reading data in parallel from local disks. This approach has also been used to improve the performance of HiMach 14. It, however, advocates a specific architecture where a parallel supercomputer, which runs the simulations, immediately pushes the trajectory data to a local analysis cluster where trajectory fragments are cached on node-local disks. In the absence of such a specific workflow, one would need to stage the trajectory across nodes, and the time for data distribution is likely to reduce any gains from the parallel analysis.

VMD 11, 51 provides molecular visualization and analysis tool through algorithmic and memory efficiency improvements, vectorization of key CPU algorithms, GPU analysis and visualization algorithms, and good parallel I/O performance on supercomputers. It is one of the most advanced programs for the visualization and analysis of MD simulations. It is, however, a large monolithic program, that is primarily driven through its built-in Tcl interface (or less frequently, through its Python interface) and thus is less well suited as a library that allows the rapid development of new algorithms or integration into workflows.

CPPTRAJ 19 offers multiple levels of parallelization (MPI and OpenMP) in a monolithic C++ implementation. It can process single, multiple, and ensembles of trajectories in parallel without changes to input scripts 52. A Python API exists in the form of the pytraj package (https://github.com/Amber-MD/pytraj), which has its own implementation of parallelization based on Python’s multiprocessing or MPI (via mpi4py 30, 31).

pyPcazip 53 is a suite of software tools written in Python for compression and analysis of MD simulation data, in particular ensembles of trajectories. pyPcazip is MPI parallelized and is specific to PCA-based investigations of MD trajectories and supports a wide variety of trajectory file formats (based on the capabilities of the underlying MDTraj package 20). pyPcazip can take one or many input MD trajectory files and convert them into a highly compressed, HDF5-based pcz format with insignificant loss of information. However, the package does not support general purpose analysis.

In situ analysis is an approach to execute analysis simultaneously with the running MD simulation so that I/O bottlenecks are mitigated 54, 55. Malakar et al. 54 studied the scalability challenges of time and space shared modes of analyzing large-scale MD simulations through a topology-aware mapping for simulation and analysis using the LAMMPS code. Similarly, Taufer and colleagues 55 presented their own framework for in situ analysis, which is based on fast on-the-fly calculation of metadata that characterizes protein substructures via maximum eigenvalues of distance matrices. These metadata are used to index trajectory frames and enable targeted analysis of trajectory subsets. Both studies provide important ideas and approaches towards moving towards online-analysis in conjunction with a running simulation but are limited in generality.

All of the above frameworks provide tools for parallel analysis of MD trajectories. Although straggler tasks are a common challenge arising in parallel analysis and are well-known in the data analysis community (see Section 2.1), there is, to our knowledge, little discussion about this problem in the biomolecular simulation community. Our own experience with a MapReduce approach in MDAnalysis 24, 26 suggested that stragglers might be a somewhat under-appreciated problem. Therefore, in the present work we wanted to better understand requirements for efficient parallel analysis of MD trajectories in MDAnalysis, but to also provide more general guidance that could benefit developments in other libraries.

3 Algorithms and Software Packages

For our investigation of parallel trajectory analysis we focus on using MPI as the standard approach to parallelization in HPC. We employ the Python language, which is widely used in the scientific community because it facilitates rapid development of small scripts and code prototypes as well as development of large applications and highly portable and reusable modules and libraries. We use the MDAnalysis library to calculate a “RMSD time series” (explained in section 3.1) as a representative use case. Further details on the software packages are provided in sections 3.2–3.3.

3.1 RMSD Calculation with MDAnalysis

Simulation data exist in trajectories in the form of time series of atom positions and sometimes velocities. Trajectories come in a plethora of different and idiosyncratic file formats. MDAnalysis 18, 17 is a widely used open source library to analyze trajectory files with an object oriented interface. The library is written in Python, with time critical code in C/C++/Cython. MDAnalysis supports most file formats of simulation packages including CHARMM 56, Gromacs 57, Amber 58, and NAMD 59 and the Protein Data Bank 60 format. At its core, it reads trajectory data in different formats and makes them available through a uniform API; specifically, coordinates are represented as standard NumPy arrays 61.

As a test case that is representative of a common task in the analysis of biomolecular simulation trajectories we calculated the time series of the minimal structural root mean square distance (RMSD) after rigid body superposition 28, 7. The RMSD is used to show the rigidity of protein domains and more generally characterizes structural changes. It is calculated as a function of time $t$ as

[TABLE]

where $\mathbf{x}_{i}(t)$ is the position of atom $i$ at time $t$ , $\mathbf{x}_{i}^{\text{ref}}$ is its position in a reference structure and the distance between these two is minimized by finding the optimum $3\times 3$ rotation matrix $\mathsf{R}$ and translation vector $\mathbf{t}$ . The optimum rigid body superposition was calculated with the QCPROT algorithm 27, 62 (implemented in Cython and available through the MDAnalysis.analysis.rms module 18).

The RMSD trajectory analysis was parallelized with a simple split-apply-combine approach 23 as outlined in the flow chart in Figure 1, with further details available in Algorithm 1. Each MPI process loads the core MDAnalysis data structure (called the Universe), which includes loading a shared “topology” file with the simulation system information and opening the shared trajectory file. Each process operates on a different block of frames (split) and iterates through them by reading the coordinates of a single frame into memory and performing the RMSD computation with them (apply). Once all frames in the block are processed, the trajectory file is closed and results are communicated to MPI rank 0 using MPI_Gather() (combine).

The RMSD was determined for a subset of protein atoms, the $N=146$ Cα atoms in the so-called “core” domain of our test system, the protein adenylate kinase 63 (see section 4.3 for further details). The time complexity for the RMSD Algorithm 1 is $\mathcal{O}(T\times N^{2})$ where $T$ is the number of frames in the trajectory and $N$ the number of particles included in the RMSD calculation 27.

3.2 MPI for Python (mpi4py)

MPI for Python (mpi4py) is a Python wrapper for the Message Passing Interface (MPI) standard and allows any Python program to employ multiple processors 30, 31. Performance degradation due to using mpi4py is not prohibitive 30, 31 and the overhead is far smaller than the overhead associated with the use of interpreted versus compiled languages 34. Overheads in mpi4py are small compared to C code if efficient raw memory buffers are used for communication 30, as used in the present study.

3.3 MPI and Parallel HDF5

HDF5 is a structured self-describing hierarchical data format which is a common mechanism for storing large quantities of numerical data in Python (http://www.hdfgroup.org/HDF5, 64). Parallel HDF5 (PHDF5) typically sits on top of a MPI-IO layer (parallel I/O with MPI) and can use MPI-IO optimizations. In PHDF5, all file access is coordinated by the MPI library; otherwise, multiple processes would compete over accessing the same file on disk. MPI-based applications launch multiple parallel instances of the Python interpreter that communicate with each other via the MPI library. Implementation is straightforward as long as the user supplies a MPI communicator and takes into account some constraints required for data consistency 64. HDF5 itself handles nearly all the details involved with coordinating file access when the shared file is opened through the mpio driver.

MPI has two flavors of operation: collective (all processes have to participate in the same order) and independent (processes can perform the operation in any order or not at all) 64. With PHDF5, modifications to file metadata must be performed collectively and although all processes perform the same task, they do not need to be synchronized 64. Other tasks and any type of data operations can be performed independently by processes. In the present study, we use independent operations.

MDAnalysis does not currently have a reader for HDF5 files or HDF5-based trajectories. In order to get a sense of the performance that is possible with a HDF5 trajectory we replaced the MDAnalysis trajectory reading in Algorithm 1 with directly accessing a HDF5 Dataset in a HDF5 file that was opened with the mpio parallel driver, as shown in Algorithm 2. The HDF5 file was generated from the original XTC trajectory file as described in Section 4.3.

4 Benchmark Environment

Our benchmark environment consisted of three different XSEDE 33 HPC resources (described in section 4.1), the software stack used (section 4.2), which had to be compiled for each resource, and the common test data set (section 4.3).

4.1 HPC Resources

The computational experiments were executed on standard compute nodes of three XSEDE 33 supercomputers, SDSC Comet, PSC Bridges, and LSU SuperMIC (Table 1). SDSC Comet is a 2 PFlop/s cluster with 2,020 compute nodes in total. It is optimized for running a large number of medium-size calculations (up to 1,024 cores) to support the most prevalent type of calculation on XSEDE resources. PSC Bridges is a 1.35 PFlop/s cluster with different types of computational nodes, including 16 GPU nodes, 8 large memory and 2 extreme memory nodes, and 752 regular nodes. It was designed to flexibly support both traditional (medium scale calculations) and non-traditional (data analytics) HPC uses. LSU SuperMIC offers 360 standard compute nodes with a peak performance of 557 TFlop/s. The parallel file system on all three machines is Lustre (http://lustre.org/) and is shared between the nodes of each cluster.

4.2 Software

Table 2 lists the tools and libraries that were required for our computational experiments. Many domain specific packages are not available in the standard software installation on supercomputers. We therefore had to compile them, which in some cases required substantial effort due to non-standard building and installation procedures or lack of good documentation. Because this is a common problem that hinders reproducibility we provide detailed version information, notes on the installation process, as well as comments on the ease of installation and the quality of the documentation in Table 2. For the MPI implementation we used Open MPI release 1.10.7 (https://www.open-mpi.org/) consistently everywhere. We used the h5py package for HDF5, which enables parallel HDF5 from Python because its dependencies, the HDF5 library itself and mpi4py, were both built against Open MPI. We used Python 2.7 because it provided maximum compatibility between packages at the time when the project was started. In principle the complete Python-dependent software stack could also be set up with Python 3.5 or higher, which is recommended because Python 2 reached end of life in January 2020. Detailed instructions to create the computing environments together with the benchmarking code can be found in the GitHub repository as described in Section 5.3. Carefully setting up the same software stack on the three different supercomputers allowed us to clearly demonstrate the reproducibility of our results and showed that our findings were not dependent on machine specifics.

4.3 Data Set

The test system contained the protein adenylate kinase with 214 amino acid residues and 3341 atoms in total 63 and the topology information (atoms types and bonds) was stored in a file in CHARMM PSF 56 format. The test trajectory was created by concatenating 600 copies of a MD trajectory with 4,187 time frames 65 (saved every 240 ps for a total simulated time of 1.004 $\mu\text{s}$ ) in CHARMM DCD 56 format and converting to Gromacs 57 XTC format trajectory, as described for the “600x” trajectory in Khoshlessan et al. 24. The trajectory had a file size of about 30 GB and contained 2,512,200 frames (corresponding to 602.4 $\mu\text{s}$ simulated time). The file size was relatively small because water molecules that were also part of the original MD simulations were stripped to reduce the original file size by a factor of about 10; such preprocessing is a common approach if one is only interested in the protein behavior. Thus, the trajectory represents a small to medium system size in the number of atoms and coordinates that have to be loaded into memory for each time frame. The XTC format is a format with lossy compression 66, 67, which also contributed to the compact file size. XTC trades lower I/O demands for higher CPU demands during decompression and therefore performed well in our previous study 24. In order to assess the performance of reading from an HDF5 file in parallel (see Section 3.3) we generated a trajectory-like HDF5 file with the data required to perform the RMSD calculation. This HDF5 file was created from the XTC file by sub-selecting the atoms for which the RMSD was calculated as detailed in Section 3.1; a Python script to perform the trajectory conversion can be found in the GitHub repository (see Section 5.3). The coordinates were stored as a two-dimensional $T\times 3N$ array where the first dimension contained $T=2,512,200$ frames and the second dimension the $3N=438$ Cartesian coordinates. Although 2,512,200 frames represents a long simulation for current standards, such trajectories will become increasingly common due to the use of special hardware 68, 69 and GPU-acceleration 70, 71, 57.

5 Methods

In the following we define the quantities and approach used for our performance measurements, with a full summary of all definitions in Table 3. We evaluated MPI performance of the parallel RMSD time series algorithm 1 and its variation (algorithm 2) by timing the total time to solution as well as the execution time for different parts of the code for individual MPI ranks with the help of the Python time.time() function.

5.1 Timing Observables

We abbreviate the timings in the following as variables $t_{\text{L$ n $}}$ where L $n$ refers to the line number in algorithm 1 (or algorithm 2, see Table 3). In the function block_rmsd(), we measured the read I/O time for ingesting the data of one trajectory frame from the file system into memory, $t_{\text{I/O}}^{\text{frame}}=t_{\text{L4}}$ , and the compute time per trajectory frame to perform the computation, ${t}_{\text{comp}}^{\text{frame}}=t_{\text{L5}}$ . The total read I/O time for a MPI rank, ${t}_{\text{I/O}}=\sum_{\text{frame}=1}^{N_{\text{b}}}{t}_{\text{I/O}}^{\text{frame}}$ , is the sum over all I/O times for all the $N_{\text{frames}}$ frames assigned to the rank; similarly, the total compute time for a MPI rank is ${t}_{\text{comp}}=\sum_{\text{frame}=1}^{N_{\text{b}}}{t}_{\text{comp}}^{\text{frame}}$ . The time delay between the end of the last iteration and exiting the for loop is $t_{\text{end\_loop}}=t_{\text{L6}}$ . The time $t_{\text{opening\_trajectory}}=t_{\text{L2}}+t_{\text{L3}}$ measures the problem setup, which includes data structure initialization and opening of topology and trajectory files. The communication time, $t_{\text{comm}}=t_{\text{L16}}$ , is the time to gather all data from all processor ranks to rank zero. The total time (for all frames) spent in block_rmsd() is $t_{\text{RMSD}}=\sum_{i=1}^{8}t_{\text{L$ i $}}$ . There are parts of the code in block_rmsd() that are not covered by the detailed timing information of ${t}_{\text{comp}}$ and ${t}_{\text{I/O}}$ . Unaccounted time is considered as overhead. We define $t_{\text{Overhead1}}$ and $t_{\text{Overhead2}}$ as the overheads of the calculations (see Table 3 for the definitions); both are expected to be negligible, which was the case in all our measurements. Finally, the total time to completion of a single MPI rank, when utilizing $N$ cores for the execution of the overall experiment, is $t_{N}$ , and as a result $t_{\text{RMSD}}+t_{\text{comm}}\equiv t_{N}$ .

5.2 Performance Parameters

We measured the total time to solution $t_{\text{total}}(N)$ with $N$ MPI processes on $N$ cores, which is effectively $t_{\text{total}}(N)\approx\max(t_{N})$ . Strong scaling was quantified by the speed-up

[TABLE]

relative to performance on a single core ( $t_{\text{total}}(1)$ ), and the efficiency

[TABLE]

Averages over ranks were calculated as

[TABLE]

and

[TABLE]

Additionally, we introduced two performance parameters that we found to be indicative of the occurrence of stragglers. We defined the ratio of compute time to read I/O time for the serial code as

[TABLE]

where the last equality shows that the ratio can also be computed from the average times per frame, $\overline{{t}_{\text{comp}}^{\text{frame}}}$ and $\overline{{t}_{\text{I/O}}^{\text{frame}}}$ . $R_{\text{comp/IO}}$ was calculated with the serial versions of our algorithms (on a single CPU core) in order to characterize the computational problem in the absence of parallelization. The ratio of compute to communication time was defined by the ratio of average total compute time to the average total communication time

[TABLE]

Because $t_{\text{comm}}$ cannot be measured for a serial code, we estimated $R_{\text{comp/comm}}$ from the rank-averages (Eqs. 4 and 6) for a given number of MPI ranks.

5.3 Data sharing

Documentation and benchmark/trajectory conversion scripts are made available in the repository https://github.com/hpcanalytics/supplement-hpc-py-parallel-mdanalysis under the GNU General Public License v3.0 (code) and the Creative Commons Attribution-ShareAlike (documentation). All materials are archived under DOI 10.5281/zenodo.3351616. These materials should enable users to recreate the computational environment on the tested XSEDE HPC resources (SDSC Comet, PSC Bridges, LSU SuperMIC), prepare data files, and run the computational experiments.

6 Computational Experiments

We had previously measured the performance of the MPI-parallelized RMSD analysis task on two different HPC resources (SDSC Comet and TACC Stampede) and had found that it only scaled well up to a single node due to high variance in the runtime of the MPI ranks, similar to the straggler phenomenon observed in big-data analytics 24. However, the ultimate cause for this high variance could not be ascertained. We therefore performed more measurements with more detailed timing information (see section 5) on SDSC Comet (described in this section) and two other supercomputers (summarized in section 7) in order to better understand the origin of the stragglers and find solutions to overcome them.

6.1 RMSD Benchmark

We measured strong scaling for the RMSD analysis task (Algorithm 1) with the 2,512,200 frame test trajectory (section 4.3) on 1 to 72 cores (one to three nodes) of SDSC Comet (Figures 2(a) and 2(b)). We observed poor strong scaling performance beyond a single node (24 cores), comparable to our previous results 24. A more detailed analysis showed that the RMSD computation, and to a lesser degree the read I/O, considered on their own, scaled well beyond 50 cores (yellow and blue lines in Figure 2(c)). But communication (sending results back to MPI rank 0 with MPI_Gather(); red line in Figure 2(c)) and the initial file opening (loading the system information into the MDAnalysis.Universe data structure from a shared topology file and opening the shared trajectory file; gray line in Figure 2(c)) started to dominate beyond 50 cores. Communication cost and initial time for opening the trajectory were distributed unevenly across MPI ranks, as shown in Figure 2(d). The ranks that took much longer to complete than the typical execution time of the other ranks were the stragglers that hurt performance.

We qualitatively denoted by straggler any MPI rank that took at least about twice as long as the group of ranks that finished fastest, roughly following the original description of a straggler as a task that took an “unusually long time to complete” 35. The fast-finishing ranks were generally clearly distinguishable in the per-rank timings such as in Figures 2(d) and 9(d). Such a qualitative definition of stragglers was sufficient for our purpose of identifying scalability bottlenecks, as shown in the following discussion.

Identification of Scalability Bottlenecks

In the example shown in Figure 2(d), 62 ranks out of 72 took about 60 s (the stragglers) whereas the remaining ranks only took about 20 s. In other instances, far fewer ranks were stragglers, as shown, for example, in Figure 9(d). The detailed breakdown of the time spent on each rank (Figure 2(d)) showed that the computation, ${t}_{\text{comp}}$ , was relatively constant across ranks. The time spent on reading data from the shared trajectory file on the Lustre file system into memory, ${t}_{\text{I/O}}$ , showed variability across different ranks. The stragglers, however, appeared to be defined by occasionally much larger communication times, $t_{\text{comm}}$ (line 16 in Algorithm 1), which were on the order of 30 s, and by larger times to initially open the trajectory (line 2 in Algorithm 1). $t_{\text{comm}}$ varied across different ranks and was barely measurable for a few of them. Although the data in Figure 2(d) represented one run and in other instances different number of ranks were stragglers, the averages over all ranks in five independent repeats (Figure 2(c)) showed that increased $t_{\text{comm}}$ were generally the reason for large variations in the run time for each rank. This initial analysis indicated that communication was a major issue that prevented good scaling beyond a single node but the problems related to file I/O also played an important role in limiting scaling performance.

Influence of Hardware

We ran the same benchmarks on multiple HPC systems that were equipped with a Lustre parallel file system [XSEDE’s PSC Bridges (Figure 9) and LSU SuperMIC (Figure 10)], and observed the occurrence of stragglers, in a manner very similar to the results described for SDSC Comet. There was no clear pattern in which certain MPI ranks would always be a straggler, and neither could we trace stragglers to specific cores or nodes. Therefore, the phenomenon of stragglers in the RMSD case was reproducible on different clusters and thus appeared to be independent from the underlying hardware.

6.2 Effect of Compute to I/O Ratio on Performance

The results in section 6.1 indicated that opening the trajectory, communication, and read I/O were important factors that appeared to correlate with stragglers. In order to better characterize the RMSD task, we computed the ratio between the time to complete the computation and the time spent on I/O per frame, $R_{\text{comp/IO}}$ (Eq. 7). The average values were $\overline{{t}_{\text{comp}}^{\text{frame}}}=0.09\ \text{ms}$ , $\overline{t_{\text{IO}}^{\text{frame}}}=0.3\ \text{ms}$ , resulting in a compute-to-I/O ratio $R_{\text{comp/IO}}\approx 0.3$ . With $R_{\text{comp/IO}}\ll 1$ , the RMSD analysis task was characterized as I/O bound.

In other studies, better scaling was observed for analysis tasks that were more compute-intensive than the RMSD calculation, such as a radial distribution function calculation 52, 26, i.e., analysis tasks that could be characterized as compute-bound. Such behavior is expected, as the contribution from the parallel part of the program that requires neither I/O nor communication is increased. From a practical point of view, it is of interest to understand the size of the effect of increasing the computational load on strong scaling, and in our case, we were interested in seeing if changes in the compute part (namely the RMSD calculation on coordinates held in memory) would have an effect on the execution time of other parts of the program. In Appendix B we set out to analyze compute bound tasks, i.e. ones with $R_{\text{comp/IO}}\gg 1$ . To assess the effect of the $R_{\text{comp/IO}}$ ratio on performance while leaving other parameters the same, we artificially increased the computational load by repeating the RMSD calculation and measured strong scaling (Figure 12). With increasing $R_{\text{comp/IO}}$ , the impact of stragglers appeared to lessen (although they did not disappear) and scaling performance improved, as expected (see Appendix B.1). Better scaling also went together with a higher ratio of compute to communication ( $R_{\text{comp/comm}}$ , Eq. 8) as shown in Appendix B.2 but ultimately I/O seemed to be the key determinant for performance.

In order to study an extreme case of a compute-bound task that would demonstrate the effect of “ideal” read I/O, we eliminated all I/O from the RMSD task by randomly generating artificial trajectory data in memory; the data had the same size as if they had been obtained from the trajectory file. The time for the data generation was excluded and no file access was necessary. Without any I/O, performance improved markedly (Figure 3), with reasonable scaling up to 72 cores (3 nodes). No stragglers were observed because overall communication time decreased dramatically by more than a factor of ten and showed less variability (Figure 3(b)) compared to the same runs with I/O (Figure 2(c)), even though exactly the same amount of data were communicated. The scaling performance suffered somewhat for more than 40 processes only because the cost of communication $t_{\text{comm}}$ became comparable to the compute time ${t}_{\text{comp}}$ and would not decrease further. Although in practice I/O cannot be avoided, this experiment demonstrated that the way how the trajectory file was accessed on the Lustre file system was at least one cause for the observed stragglers. It also showed that the communication cost for the same amount of data transfer could be dramatically higher in the presence of I/O than in its absence.

6.3 Reducing I/O Cost

In order to improve performance we needed to employ strategies to avoid the competition over file access across different ranks when the $R_{\text{comp/IO}}$ ratio was small. One obvious approach when using the Lustre parallel file system is to increase the number of stripes, i.e., the number of copies of a file that are stored on different object storage targets (OSTs). But because in our previous work we did not see scaling performance improvement with varying the stripe count 24 we decided to just use the system default, i.e., one stripe per file. Instead we experimented with two different ways for reducing the I/O cost:

splitting the trajectory file into as many segments as the number of processes (subfiling), thus using file-per-process access, and
using the HDF5 file format together with MPI-IO parallel reads instead of the XTC trajectory format.

We discuss these two approaches and their performance improvements in detail in the following sections.

6.3.1 Splitting the Trajectories (subfiling)

Subfiling is a mechanism previously used for splitting a large multi-dimensional global array to a number of smaller subarrays in which each smaller array is saved in a separate file. Subfiling reduces the file system control overhead by decreasing the number of processes concurrently accessing a shared file 72, 73. Because subfiling is known to improve performance of parallel shared-file I/O, we investigated splitting our trajectory file into as many trajectory segments as the number of processes. The trajectory file was split into $N$ segments, one for each process, with each segment having $N_{b}$ frames. This way, each process would access its own trajectory segment file without competing for file accesses.

We ran a benchmark up to 8 nodes (192 cores) and observed rather better scaling behavior with efficiencies above 0.6 (Figure 4(a) and 4(b)) with the delay time for stragglers reduced from 65 s to about 10 s for 72 processes. However, scaling was still far from ideal due to the MPI communication costs. Although the delay due to communication was much smaller than compared to parallel RMSD with shared-file I/O [compare Figure 4(d) ( $t_{\text{comm}}^{\text{Straggler}}\gg{t}_{\text{comp}}+{t}_{\text{I/O}}$ ) to Figure 2(d) ( $t_{\text{comm}}^{\text{Straggler}}\approx{t}_{\text{comp}}+{t}_{\text{I/O}}$ )], it was still delaying several processes and resulted in longer job completion times (Figure 4(d)). These delayed tasks impacted performance so that speed-up remained far from ideal (Figure 4(b)).

The subfiling approach appeared promising but it required preprocessing of trajectory files and additional storage space for the segments. We benchmarked the necessary time for splitting the trajectory in parallel using different number of MPI processes (Table 4); in general the operation scaled well, with efficiencies $>0.8$ although performance fluctuated, as seen for the case on six nodes where the efficiency dropped to $0.34$ for the run. These preprocessing times were not included in the estimates because we focused on better understanding the principal causes of stragglers and we wanted to make the results directly comparable to the results of the previous sections. Nevertheless, from an end user perspective, preprocessing of trajectories can be integrated in workflows (especially as the data in Table 4 indicated good scaling) and the preprocessing time can be quickly amortized if the trajectories are analyzed repeatedly. However, the requirement of needing as many segments as processes makes the approach somewhat inflexible as a new set of trajectory segments must be produced when a different level of parallelization is needed. Finally, the performance of parallel file systems generally suffers when too many files are processed and so there exists a limit as to how far the subfiling approach can be pushed.

Often trajectories from MD simulations on HPC machines are produced and kept in smaller files or segments that can be concatenated to form a full continuous trajectory file. These trajectory segments could be used for the subfiling approach. However, it might not be feasible to have exactly one segment per MPI rank, with all segments of equal size, which constitutes the ideal case for subfiling. MDAnalysis can create virtual trajectories from separate trajectory segment files that appear to the user as a single trajectory. In Appendix C we investigated if this so-called ChainReader functionality could benefit from the subfiling approach. We found some improvements in performance but discovered limitations in the design of the ChainReader (namely that all segment files are initially opened) that will have to be addressed before equivalent performance can be reached.

6.3.2 MPI-based Parallel HDF5

In the HPC community, parallel I/O with MPI-IO is widely used in order to address I/O limitations. We investigated MPI-based Parallel HDF5 to improve I/O scaling. We converted our XTC trajectory file into a simple custom HDF5 format so that we could test the performance of parallel I/O with the HDF5 file format. The time it took to convert our XTC file with $2,512,200$ frames into HDF5 format was about 5,400 s on a local workstation with network file system (NFS).

We ran our benchmark on up to 16 nodes (384 cores) on SDSC Comet and we observed near ideal scaling behavior (Figures 5(a) and 5(b)) with parallel efficiencies above 0.8 on up to 8 nodes (Figure 11(a)) with no straggler tasks (Figure 5(d)). The trajectory reading I/O ( ${t}_{\text{I/O}}$ ) was the dominant contribution, followed by compute ( ${t}_{\text{comp}}$ ), but because both contributions scaled well, overall scaling performance remained good, even for 384 cores. Amongst the five repeats for 12 nodes (288 cores) we observed one run with much slower I/O than typical (Figure 5(c)) but we did not further investigate this spurious case and classified it as an outlier that was excluded from the statistics. Importantly, the trajectory opening cost remained negligible (in the millisecond range) and the cost for MPI communication also remained small (below 0.1 s, even for 16 nodes). Overall, parallel MPI with HDF5 appeared to be a robust approach to obtain good scaling, even for I/O-bound tasks.

6.4 Potential Causes of Stragglers

The data indicated that an increase in the duration of both MPI communication and trajectory file access lead to large variability in the run time of individual MPI processes and ultimately poor scaling performance beyond a single node. A discussion of likely causes for stragglers begins with the observation that opening and reading a single trajectory file from multiple MPI processes appeared to be at the center of the problem.

In MDAnalysis, individual trajectory frames are loaded into memory, which ensures that even systems with tens of millions of atoms can be efficiently analyzed on resources with moderate RAM sizes. The test trajectory (file size 30 GB) had $2,512,200$ frames in total so each frame was about 0.011 MB in size. With ${t}_{\text{I/O}}\approx 0.3~{}\text{ms}$ per frame, the data were ingested at a rate of about $40$ MB/s for a single process. For 24 MPI ranks (one SDSC Comet node), the aggregated reading rate would have been about 1 GB/s, well within the available bandwidth of 56 Gb/s of the InfiniBand network interface that served the Lustre file system, but nevertheless sufficient to produce substantial constant network traffic.

Furthermore, in our study the default Lustre stripe size value was 1 MB, i.e., the amount of contiguous data stored on a single Lustre OST. Each I/O request read a single Lustre stripe but because the I/O size (0.011 MB) was smaller than the stripe size, many of these I/O requests were likely just accessing the same stripe on the same OST but nevertheless had to acquire a new reading lock for each request. The reason for this behavior is related to ensuring POSIX consistency that relates to POSIX I/O API and POSIX I/O semantics, which can have adverse effects on scalability and performance. Parallel file systems like Lustre implement sophisticated distributed locking mechanisms to ensure consistency. For example, locking mechanisms ensures that a node can not read from a file or part of a file while it might be being modified by another node. When the application I/O is not designed in a way to avoid scenarios where multiple nodes are fighting over locks for overlapping extents, Lustre can suffer from scalability limitations 74. Continuously keeping metadata updated in order to have fully consistent reads and writes (POSIX metadata management), requires writing a new value for the file’s last-accessed time (POSIX atime) every time a file is read, imposing a significant burden on parallel file system 75. Mache et al. observed that contention for the interconnect between OSTs and compute nodes due to MPI communication may lead to variable performance in I/O measurements 76. Conversely, our data suggest that single-shared-file I/O on Lustre can negatively affect MPI communication as well, even at moderate numbers (tens to hundreds) of concurrent requests, similar to the results from recent network simulations that predicted interference between MPI and I/O traffic 77. Brown et al.’s work 77 indicated that MPI traffic (inter-process communication) could be affected by increasing I/O. In particular, a few MPI processes were always delayed by one to two orders of magnitude more than the median time, which we would classify as stragglers. In summary, our observations, seen in the context of the work by Brown et al. 77, suggest that our observed stragglers with large variance in the communication step could be due to interference of MPI communications with the I/O requests on the same network. Further detailed work will be needed to test this hypothesis.

Our results clearly showed that reading a single shared file is an inefficient way to use the Lustre parallel file system; instead, parallel I/O via MPI-IO and HDF5 emerged as the most promising approach to avoid stragglers and obtain good strong scaling behavior on hundreds of cores, even for I/O bound analysis tasks.

7 Reproducibility and Performance Comparison on Different Clusters

In this section we compare the performance of the RMSD task on different HPC resources (Table 1) to examine the robustness of the methods we used for our performance study and to ensure that the results are general and independent from the specific HPC system. Scripts and instructions to set up the computational environments and reproduce our computational experiments are provided in a git repository as described in section 5.

In Appendix A we demonstrated that stragglers occured on PSC Bridges (Figure 9) and LSU SuperMIC (Figure 10) in a manner similar to the one observed on SDSC Comet (section 6.1). We performed additional comparisons for several cases discussed previously, namely (1) splitting the trajectories with blocking collective communications in MPI and (2) MPI-based parallel HDF5.

7.1 Splitting the Trajectories

Figure 6 shows the strong scaling of the RMSD task on LSU SuperMIC and SDSC Comet with subfiling. The results were comparable between the two clusters, with scaling far from ideal due to the communication cost (see section 6.3.1 and Figure 6(c)). On SuperMIC, scaling was excellent on up to two nodes (40 cores) but beyond two nodes the communication cost increased markedly for unknown reasons and thus leading to reduced scaling behavior. Overall, the scaling of the RMSD task was slighly better on LSU SuperMIC than on SDSC Comet but qualitatively similar.

7.2 MPI-based Parallel HDF5

Figure 7 shows the scaling on SDSC Comet, LSU SuperMIC, and PSC Bridges using MPI-based parallel HDF5. Performance on SDSC Comet and LSU SuperMIC was very good with near ideal linear strong scaling. The performance on PSC Bridges was sensitive to how many cores per node were used. Using all 28 cores in a node resulted in poor performance but decreasing the number of cores per node and equally distributing processes over nodes improved the scaling (Figure 7), mainly by reducing variation in the I/O times.

The main difference between the runs on PSC Bridges and SDSC Comet/LSU SuperMIC appeared to be the variance in ${t}_{\text{I/O}}$ (Figure 7(c)). The I/O time distribution was fairly small and uniform across all ranks on SDSC Comet and LSU SuperMIC (Figures 8(b) and 5(d)). However, on PSC Bridges the I/O time was on average about two and a half times larger and the I/O time distribution was also more variable across different ranks (Figure 8(a)).

7.3 Comparison of Compute and I/O Scaling Across Different Clusters

A full comparison of compute and I/O scaling across different clusters for different test cases and algorithms is shown in Table 5. For MPI-based parallel HDF5, both the compute and I/O time on Bridges were consistently larger than their corresponding values on SDSC Comet and LSU SuperMIC. For example, with one core the corresponding compute and I/O time were ${t}_{\text{comp}}=387~{}\text{s}$ , ${t}_{\text{I/O}}=1318~{}\text{s}$ versus $225~{}\text{s}$ , $423~{}\text{s}$ on SDSC Comet and $273~{}\text{s}$ , $503~{}\text{s}$ on LSU SuperMIC. This performance difference became larger with increasing core number.

Overall, the results from SDSC Comet and LSU SuperMIC were consistent with each other. Performance on PSC Bridges seemed sensitive to the exact allocation of cores on each node but nevertheless the approaches that decreased the occurrence of stragglers on SDSC Comet and LSU SuperMIC also improved performance on PSC Bridges. Thus, the findings described in the previous sections were valid for a range of different HPC clusters with Lustre file systems.

[FIGURE:]

8 Guidelines for Improving Parallel Trajectory Analysis Performance

Although the performance measurements were performed with MDAnalysis and therefore capture some details of this library such as the specific timings for file reading, we believe that the broad picture is fairly general and applies to any Python-based approach that uses MPI for parallelizing trajectory access with a split-apply-combine approach. Based on the lessons that we learned, we suggest the following guidelines to improve strong scaling performance:

Calculate the compute to I/O ratio ( $R_{\text{comp/IO}}$ , Eq. 7). As discussed in Section 6.2, for I/O bound problems the performance of the task will be affected by stragglers that delay job completion time.

Heuristic 1

For $R_{\text{comp/IO}}\gg 1$ , single-shared-file I/O can be used and one can expect reasonable scaling up to about 50 cores; for better scaling, one of the strategies under Heuristic 2 needs to be employed.

Heuristic 2

For $R_{\text{comp/IO}}\leq 1$ the task is I/O bound and single-shared-file I/O should be avoided. One might want to consider the following steps:

Heuristic 2.1

If there is access to the HDF5 format, use MPI-based Parallel HDF5 (Section 6.3.2). This approach may scale well to hundreds of cores.

Heuristic 2.2

If the trajectory file is not in HDF5 format then one can consider subfiling and split the single trajectory file into as many trajectory segments as the number of processes. This approach may scale reasonably well to less than 200 cores.

The better solution is the use of parallel I/O (Heuristic 2.1) as it makes best use of the parallel file system and scales well to hundreds of cores, regardless of $R_{\text{comp/IO}}$ . Splitting the trajectories will not scale as well as parallel I/O but it can be easily performed in parallel and trajectory conversion may be integrated into the beginning of standard workflows for MD simulations. Alternatively, trajectories may already be kept in smaller chunks if they are already produced in batches; for instance, when running simulations with Gromacs 57, the gmx mdrun -noappend option produces individual trajectory segments instead of extending an existing trajectory file.

9 Conclusions

We analyzed the strong scaling performance of a typical task when analyzing MD trajectories, the calculation of the time series of the RMSD of a protein, with the widely used Python-based MDAnalysis library. All benchmarks were performed in five replicates on three different XSEDE supercomputers to demonstrate that our results were independent from the specifics of the hardware and local environment.

The RMSD task was parallelized with MPI following the split-apply-combine approach by having each MPI process analyze a contiguous segment of the trajectory. This approach did not scale beyond a single node because straggler MPI processes exhibited large upward variations in runtime. Stragglers were primarily caused by either increased MPI communication costs or increased time to open the single shared trajectory file whereas both the computation and the ingestion of data exhibited close to ideal strong scaling behavior. Stragglers were less prevalent for compute-bound workloads (i.e., $R_{\text{comp/IO}}\gg 1$ ), suggesting that file read I/O was responsible for poor MPI communication. In particular, artificially removing all I/O substantially improved performance of the communication step and thus brought overall performance close to ideal (i.e., linear increase in speed-up with processor count with slope one), despite the fact that the amount of data to be communicated did not depend on I/O. Our results suggested that stragglers might be due to the competition between MPI messages and the Lustre file system on the shared InfiniBand interconnect, which would be consistent with other similar observations 51 and theoretical predictions by Brown et al. 77, but further work would be needed to validate this specific hypothesis. One possible interpretation of our results was that for a sufficiently large per-frame compute workload, read I/O interfered much less with communication than for an I/O bound task that almost continuously accesses the file system. This interpretation suggested that the poor scaling performance was the result of inefficient use of the Lustre file system and that we needed to improve read I/O to reduce interference.

We investigated subfiling (splitting of the trajectories into separate files, one for each MPI rank) and MPI-based parallel I/O. Subfiling improved scaling up to about 150 cores. However, subfiling, at least in the form described here, is not an ideal solution because creating and accessing many small files on a parallel file system such as Lustre can negatively impact the overall performance of the file system. Furthermore, managing a large number of files can become cumbersome and inflexible, given that the number of files determines the number of processes. When we used MPI-based parallel I/O through HDF5 together with MPI for communications we achieved nearly ideal performance up to 384 cores (16 nodes on SDSC Comet) and speed-ups of two orders of magnitude compared to the serial execution. The latter approach appears to be a promising way forward as it directly builds on very widely used technology (MPI-IO and HDF5) and echoes the experience of the wider HPC community that parallel file I/O is necessary for efficient data handling.

The biomolecular simulation community suffers from a large number of trajectory file formats with very few being based on HDF5, with the exception of the H5MD format 78 and the MDTraj HDF5 format 20. Our work suggests that HDF5-based formats should be seriously considered for MD simulations if users want to make efficient use of their HPC systems for analysis. Alternatively, enabling MPI-IO for trajectory readers in libraries such as MDAnalysis might also provide a path forward to better read performance.

We summarized our findings in a number of guidelines for improving the scaling of parallel analysis of MD trajectory data. We showed that it is feasible to run an I/O bound analysis task on HPC resources with a Lustre parallel file system and achieve good scaling behavior up to 384 CPU cores with an almost 300-fold speed-up compared to serial execution.

Future Directions

Future work might look into testing different MPI implementations, especially in combination with parallel HDF5. Choosing the best performing MPI implementation for a specific system and optimizing the parallel file system parameters might also lead to further improvements. Although our results showed qualitatively similar behavior on three different HPC resources, unexplained differences in performances remained. Deeper insights into the system-level network traffic and parallel file system access would be necessary to approach performance tuning for different HPC systems in a rational manner.

Our HDF5 results are encouraging but lack a convenient and widely available implementation. Therefore, a HDF5-based trajectory reader needs to be implemented in MDAnalysis for an existing HDF5 trajectory format. The algorithm for the analysis task could be optimized by reducing file access to the shared system topology file (and any other data common to all ranks, such as the reference coordinates in the RMSD analysis) by using MPI_Scatter and MPI_Gather to efficiently communicate the static data. In this case, only rank 0 would read these data from the file system and then scatter them to all other ranks. Each rank would then build their own MDAnalysis Universe from those data (either by gathering a serialized Universe data structure or by using Python StringIO to read the scattered text buffer containing the topology file) and their own parallel file access to an HDF5 trajectory (with the Universe.load_new() method to attach a trajectory).

In summary, the encouraging finding of this work is that by using parallel file reading (here tested with HDF5), the simple split-apply-combine single trajectory parallelization approach can work on current HPC systems up to a few hundred cores, even for I/O-bound tasks. The major advantage of the approach is its simplicity as users can directly use their serial code and apply it to blocks of a trajectory, without having to rewrite their algorithms or having to consider hybrid parallelization schemes. Although we focused on the MDAnalysis library, similar strategies are likely to be more generally applicable and useful to the wider biomolecular simulation community.

Acknowledgements

\ack

We are grateful to Sarp Oral for insightful comments on this manuscript. This work was supported by the National Science Foundation under grant numbers ACI-1443054 and ACI-1440677. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. SDSC Comet at the San Diego Supercomputer Center, LSU SuperMic at Louisiana State University, and PSC Bridges at the Pittsburgh Supercomputing Center were used under allocations TG-MCB090174 and TG-MCB130177.

Appendix A Additional Data

Figure 9 shows performance of the RMSD task on PSC Bridges.

Figure 10 shows performance of the RMSD task on LSU SuperMIC.

Figure 11 shows comparison of the parallel efficiency of the RMSD task between different test cases on SDSC Comet, PSC Bridges, and LSU SuperMIC when reading from a HDF5 file.

Appendix B Effect of Increasing the Computational Load on Scaling

Performance

We quantified the effect of increasing the computational load of the analysis task on scaling performance in order to obtain practical insights into the question under which circumstances the simple single-trajectory split-apply-combine MPI-based parallelization approach might be feasible without any other considerations such as optimization of the file read I/O. To this end, we quantified strong scaling performance as a function of the compute-to-I/O ratio $R_{\text{comp/IO}}$ (Eq. 7) and the compute-to-communication ratio $R_{\text{comp/comm}}$ (Eq. 8).

To measure the effect of an increased compute load on performance while leaving other parameters the same, we artificially increased the computational load by repeating the same RMSD calculation (line 5, algorithm 1) 40, 70 and 100 times in a loop, resulting in forty-fold (“ $40\times$ ”), seventy-fold (“ $70\times$ ”), and one hundred-fold (“ $100\times$ ”) load increases.

B.1 Effect of $R_{\text{comp/IO}}$ on Scaling Performance

For an $X$ -fold increase in workload, we expected the workload for the computation to scale with $X$ as ${t}_{\text{comp}}(X)=N_{\text{frames}}^{\text{total}}X\overline{{t}_{\text{comp}}^{\text{frame}}}$ while the read I/O workload ${t}_{\text{I/O}}(X)=N_{\text{frames}}^{\text{total}}\overline{{t}_{\text{I/O}}^{\text{frame}}}$ (number of frames times the average time to read a frame) should remain independent of $X$ . Therefore, the ratio for any $X$ should be $R_{\text{comp/IO}}(X)={t}_{\text{comp}}(X)/{t}_{\text{I/O}}(X)=XR_{\text{comp/IO}}(X=1)$ , i.e., $R_{\text{comp/IO}}$ should just linearly scale with the workload factor $X$ . The measured $R_{\text{comp/IO}}$ ratios of 11, 19, 27 for the increased computational workloads agreed with this theoretical analysis, as shown in Table 6.

We performed the experiments with increased workload to measure the effect of the $R_{\text{comp/IO}}$ ratio (Eq. 7) on performance (Figure 12). The strong scaling performance as measured by the speed-up $S(N)$ improved with increasing $R_{\text{comp/IO}}$ ratio (Figure 12(a)). The calculations consistently showed better scaling performance to larger numbers of cores for higher $R_{\text{comp/IO}}$ ratios, e.g., $N=56$ cores for the $70\times$ RMSD task. The speed-up and efficiency approached their ideal value for each processor count with increasing $R_{\text{comp/IO}}$ ratio (Figures 12(b) and 12(c)). Even for moderately compute-bound workloads, such as the $40\times$ and $70\times$ RMSD tasks, increasing the computational workload over I/O reduced the impact of stragglers even though they still contributed to large variations in timing across different ranks and thus irregular scaling.

B.2 Effect of $R_{\text{comp/comm}}$ on Scaling Performance

In Section 6.3, we improved scaling limitations due to read I/O by splitting the trajectory, but scaling remained far from ideal because of increased communication costs. These results motivated an analysis in terms of the communication costs. In addition to the compute to I/O ratio $R_{\text{comp/IO}}$ discussed in Section B.1 we defined another performance parameter called the compute to communication ratio $R_{\text{comp/comm}}$ (Eq. 8).

We analyzed the data for variable workloads (see Section B.1) in terms of the $R_{\text{comp/comm}}$ ratio. The performance clearly depended on the $R_{\text{comp/comm}}$ ratio (Figure 13). Performance improved with increasing $R_{\text{comp/comm}}$ ratios (Figures 13(a) and 12(a)) even if the communication time was larger (Figure 13(b)). Although we still observed stragglers due to communication at larger $R_{\text{comp/comm}}$ ratios ( $70\times$ RMSD and $100\times$ RMSD), their effect on performance remained modest because the overall performance was dominated by the compute load. Evidently, as long as overall performance is dominated by a component such as compute that scales well, then performance problems with components such as communication will be masked and overall acceptable performance can still be achieved (Figures 12(a) and 13(a)).

Communication was usually not problematic within one node because of the shared memory environment. For less than 24 processes, i.e., a single compute node on SDSC Comet, the scaling was good and $R_{\text{comp/comm}}\gg 1$ for all RMSD loads (Figures 12(a) and 13(a)). However, beyond a single compute node ( $>$ 24 cores), scaling appeared to improve with increasing $R_{\text{comp/comm}}$ ratio while the communication overhead decreased in importance (Figures 12(a) and 13(a)).

Appendix C Performance of the ChainReader for Split Trajectories

In section 6.3.1 we showed how subfiling (splitting the trajectories) could help overcome I/O limitations and improve scaling. However, the number of trajectories may not necessarily be equal to the number of processes. For example, trajectories from MD simulations on supercomputers are often kept in small segments in individual files that need to be concatenated later to form a trajectory that can be analyzed with common tools. Such segments might be useful for subfiling but making sure that the number of processes is equal to the number of trajectory files will not always be feasible. MDAnalysis can transparently represent multiple trajectories as one virtual trajectory using the ChainReader. This feature is convenient for serial analysis when trajectories are maintained as segments. In the current implementation of ChainReader, each process opens all the trajectory segment files but I/O will only happen from a specific block of the trajectory specific to that process only.

We wanted to test if the ChainReader would benefit from the gains measured for the subfiling approach. Specifically, we measured if the MPI-parallelized RMSD task (with $N_{\text{p}}$ ranks) would benefit if the trajectory was split into $N_{\text{seg}}=N_{\text{p}}$ trajectory segments, corresponding to an ideal scenario.

In order to perform our experiments we had to work around an issue with the XTC format reader in MDAnalysis that was related to the XTC random-access functionality that the MDAnalysis.coordinates.XTC.XTCReader class provides: The Gromacs XTC format 66, 67 is a lossy-compression, XDR-based file format that was never designed for random access and the compressed format itself does not support fast random seeking. The XTCReader stores persistent offsets for trajectory frames to disk 18 in order to enable efficient access to random frames. These offsets will be generated automatically the first time a trajectory is opened and the offsets are stored in hidden *.xtc_offsets.npz files. The advantage of these persistent offset files is that after opening the trajectory for the first time, opening the same file will be very fast, and random access is immediately available. However, stored offsets can get out of sync with the trajectory they refer to. To prevent the use of stale offset data, trajectory file data (number of atoms, size of the file and last modification time) are also stored for validation. If any of these parameters change the offsets are recalculated. If the XTC changes but the offset file is not updated then the offset file can be detected as invalid. With ChainReader in parallel, each process opens all the trajectories because each process builds its own MDAnalysis.Universe data structure. If an invalid offset file is detected (perhaps because of wrong file modification timestamps across nodes), several processes might want to recalculate these parameters and rebuild the offset file, which can lead to a race condition. In order to avoid the race condition, we removed this check from MDAnalysis for the purpose of the measurements presented here, but this comes at the price of not checking the validity of the offset files at all; future versions of MDAnalysis may lift this limitation. We obtained the results for the ChainReader with this modified version of MDAnalysis that eliminates the race condition by assuming that XTC index files are always valid.

Figure 14 shows the results for performance of the ChainReader for the RMSD task. Strong scaling plateaued for more than 92 cores and performance was worse than what was achieved when each MPI process was assigned its own trajectory segment as described in Section 6.3.1 and shown for comparison as dotted lines in Figures 14(a) and 14(b). The strong scaling performance did not suffer because of the computation and the read I/O because both ${t}_{\text{comp}}$ and ${t}_{\text{I/O}}$ showed excellent strong scaling up to 196 cores (Figure 14(c)). Instead the time for ending the for loop $t_{\text{end\_loop}}$ , which includes the time for closing the trajectory file, and opening the trajectory $t_{\text{opening\_trajectory}}$ appeared to be the scaling bottleneck. These results differed from the subfiling results (section 6.3.1) where neither $t_{\text{end\_loop}}$ nor $t_{\text{opening\_trajectory}}$ limited scaling (Figure 4(d)).

Although we did not further investigate the deeper cause for the reduced scaling performance of the ChainReader, we speculate that the primary problem is related to each MPI rank having to open all trajectory files in their ChainReader instance even though they will only read from a small subset. For $N_{\text{p}}$ ranks and $N_{\text{seg}}$ file segments, in total, $N_{p}N_{\text{seg}}$ file opening/closing operations have to be performed. Each server that is part of a Lustre file system can only handle a limited number of I/O requests (read, write, stat, open, close, etc.) per second. A large number of such requests, from one or more users and one or more jobs, can lead to contention for storage resources. For $N_{\text{p}}=N_{\text{seg}}=100$ , the Lustre file system has to perform 10,000 of these operations almost simultaneously, which might degrade performance.

These considerations indicate that the ChainReader in its current implementation limits scaling performance to less than 100 processes due to the large number of file opening operations. For better performance, the ChainReader would need to be rewritten for MPI such that the layout of the trajectory files is initially obtained by MPI rank 0 (which has to sequentially open all trajectory segments once) and then communicated to all other ranks; each rank then only opens the trajectories it needs to access, thus reducing file access to a minimum.

Bibliography78

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Borhani and Shaw 2012 Borhani DW, Shaw DE. The future of molecular dynamics simulations in drug discovery. J Comput Aided Mol Des 2012; 26(1): 15–26. doi: 10.1007/s 10822-011-9517-y . · doi ↗
2Dror et al. 2012 Dror RO, Dirks RM, Grossman JP, Xu H, Shaw DE. Biomolecular simulation: a computational microscope for molecular biology. Annu Rev Biophys 2012; 41: 429–52. doi: 10.1146/annurev-biophys-042910-155245 . · doi ↗
3Orozco 2014 Orozco M. A theoretical view of protein dynamics. Chem Soc Rev 2014; 43: 5051–5066. doi: 10.1039/C 3CS 60474 H . · doi ↗
4Perilla et al. 2015 Perilla JR, Goh BC, Cassidy CK, et al. Molecular dynamics simulations of large macromolecular complexes. Current Opinion in Structural Biology 2015; 31: 64 – 74. doi: 10.1016/j.sbi.2015.03.007 . · doi ↗
5Bottaro and Lindorff-Larsen 2018 Bottaro S, Lindorff-Larsen K. Biophysical experiments and biomolecular simulations: A perfect match? Science 2018; 361(6400): 355–360. doi: 10.1126/science.aat 4010 . · doi ↗
6Tuckerman 2010 Tuckerman ME. Statistical Mechanics: Theory and Molecular Simulation . Oxford, UK: Oxford University Press, 2010.
7Mura and Mc Anany 2014 Mura C, Mc Anany CE. An introduction to biomolecular simulations and docking. Molecular Simulation 2014; 40(10-11): 732–764. doi: 10.1080/08927022.2014.935372 . · doi ↗
8Cheatham and Roe 2015 Cheatham T, Roe D. The impact of heterogeneous computing on workflows for biomolecular simulation and analysis. Computing in Science Engineering 2015; 17(2): 30–39. doi: 10.1109/MCSE.2015.7 . · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Parallel Performance of Molecular Dynamics Trajectory Analysis

Abstract

keywords:

1 Introduction

2 Background and Related Work

2.1 Stragglers

2.2 Other Packages with Parallel Analysis Capabilities

3 Algorithms and Software Packages

3.1 RMSD Calculation with MDAnalysis

3.2 MPI for Python (mpi4py)

3.3 MPI and Parallel HDF5

4 Benchmark Environment

4.1 HPC Resources

4.2 Software

4.3 Data Set

5 Methods

5.1 Timing Observables

5.2 Performance Parameters

5.3 Data sharing

6 Computational Experiments

6.1 RMSD Benchmark

Identification of Scalability Bottlenecks

Influence of Hardware

6.2 Effect of Compute to I/O Ratio on Performance

6.3 Reducing I/O Cost

6.3.1 Splitting the Trajectories (subfiling)

6.3.2 MPI-based Parallel HDF5

6.4 Potential Causes of Stragglers

7 Reproducibility and Performance Comparison on Different Clusters

7.1 Splitting the Trajectories

7.2 MPI-based Parallel HDF5

7.3 Comparison of Compute and I/O Scaling Across Different Clusters

8 Guidelines for Improving Parallel Trajectory Analysis Performance

9 Conclusions

Future Directions

Acknowledgements

Appendix A Additional Data

Appendix B Effect of Increasing the Computational Load on Scaling

B.1 Effect of Rcomp/IOR_{\text{comp/IO}}Rcomp/IO​ on Scaling Performance

B.2 Effect of Rcomp/commR_{\text{comp/comm}}Rcomp/comm​ on Scaling Performance

Appendix C Performance of the ChainReader for Split Trajectories

B.1 Effect of $R_{\text{comp/IO}}$ on Scaling Performance

B.2 Effect of $R_{\text{comp/comm}}$ on Scaling Performance