Making PLUMED Fly: A Tutorial on Optimizing Performance

Daniele Rapetti; Massimiliano Bonomi; Carlo Camilloni; Giovanni Bussi; Gareth A. Tribello

PMC · DOI:10.1021/acs.jpcb.5c07562·February 24, 2026

Making PLUMED Fly: A Tutorial on Optimizing Performance

Daniele Rapetti, Massimiliano Bonomi, Carlo Camilloni, Giovanni Bussi, Gareth A. Tribello

PDF

Open Access

TL;DR

This tutorial explains how to optimize the performance of PLUMED, a tool used in molecular dynamics simulations, by using efficient computational techniques.

Contribution

A new tool for measuring code performance in PLUMED is introduced, along with optimization strategies for complex calculations.

Findings

01

Vector-based commands significantly improve performance over individual scalar operations for distance, angle, and torsion calculations.

02

Optimization techniques for atomic order parameters and secondary structure variables are demonstrated through benchmarks.

03

Algorithmic tricks for performance improvements are explained to help users apply them in their own work.

Abstract

PLUMED is an open-source software package that is widely used for analyzing and enhancing molecular dynamics simulations that works in conjunction with most available molecular dynamics softwares. While the computational cost of PLUMED calculations is typically negligible compared to the molecular dynamics code’s force evaluation, the software is increasingly being employed to determine complex descriptors that are more computationally demanding. For these applications performance optimization becomes critical. In this tutorial, we describe a recently implemented tool that can be used to reliably measure code performance. We then use this tool to generate detailed performance benchmarks that show how calculations of large-numbers of distances, angles or torsions can be optimized by using vector-based commands rather than individual scalar operations. We then present benchmarks that…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases3

DISTANCE COORDINATION MD

Figures15

Click any figure to enlarge with its caption.

Interaction of PLUMED with molecular dynamics codes. PLUMED is able to calculate functions of the atomic positions and apply forces to atoms by passing data to and from the underlying MD code.

Data communications between PLUMED actions. The left panel illustrates how the constituent actions in the first example input in this paper evaluate the bias function. The right panel shows how data is passed between actions when forces are evaluated using the chain rule. These figures were made by using the PLUMED command plumed show_graph, which outputs a Mermaid diagram.

Time per step as a function of the number of distances that are being computed. The blue, orange, light and dark green lines indicate the cost of using a single DISTANCE command to calculate the vector of distances using 1, 2, 4, and 8 OpenMP threads, respectively. Meanwhile, the dotted line shows the cost of using N separate DISTANCE commands. The top panel indicates the cost of calculating the distances only, while the bottom panel indicates the additional cost that comes if you apply a force on the computed distances and also need to calculate derivatives.

Time taken for a single PLUMED step as a function of the number of distances (left), angles (center) and torsions (right) that are being computed. Cost for just calculating these quantities (top panels). Cost for calculating and applying a force on the variables (bottom panels). Calculations were run on 1–32 OpenMP threads. The legend indicates what number of threads was used to produce each of the lines.

Time taken for a single PLUMED step as a function of the number of torsions that are being computed. The top panels show how the cost of calculating the torsions increases while the bottom panel shows how the cost of calculating the torsions and applying a force on these quantities changes. All the calculations that were used to generate the solid lines for graphs in the left column were run on 8 processors. For the light green line, all processors communicated via OpenMP, while the blue line shows the result that was obtained when communication between the 8 processors was managed using MPI. The dark green and orange lines show the results obtained when the two communication protocols are mixed. The orange line shows timings that are obtained by having four MPI processors that each run on two OpenMP threads, while the dark green line indicates the result that is obtained by having two MPI processors running on four OpenMP threads each. The orange and green dashed lines are results obtained when you run with 2 and 4 OpenMP threads, respectively. The lines on the graphs in the right column were obtained from calculations that were parallelized over 1 to 32 MPI processes.

Cell structures and linked list technique that is introduced in ref and employed within PLUMED. The cell box is divided into a set of cubes with a side length of d max. The cell each atom resides within is then determined at the start of the calculation. This reduces computational expense when we compute contact matrices because we can determine the neighbors of the blue atom by iterating over the atoms in the cell the blue atom resides in and the atoms in this cell’s immediate neighbors. The distance between the blue atom and any atom that is not in the same cell or one of its neighbors is guaranteed to be greater than d max.

Cost of a single PLUMED step as a function of the number of atoms for which symmetry functions are being evaluated. The top panel shows how the cost of calculating the symmetry functions changes, while the bottom panel shows how the cost of calculating the symmetry function and applying a force upon them changes. The various lines show the costs when the calculation is run on the numbers of OpenMP threads indicated in the legend.

Cost of a single PLUMED step as a function of the volume of the spherical region in which you are evaluating symmetry functions for the atoms. The bottom x-axis indicates the radius of the spherical region in which the symmetry functions are being evaluated, while the top x axis indicates the number of atoms for which symmetry functions are being evaluated.

Actions used to evaluate secondary structure variables.

Method via which the STRANDS_CUTOFF keyword of ANTIBETARMSD improves the performance of this action. PLUMED needs to determine if the three residue segment in the blue rectangle forms an antiparallel β sheet with each of the three residues in each of the red and green rectangles by computing R(X i , X ref). However, before computing these R(X i , X ref) values, PLUMED computes the distance between the yellow atom and each of the pink atoms. The value of R(X i , X ref) is then only computed if this distance is less than the STRANDS_CUTOFF value. Consequently, we compute two R(X i , X ref) values rather than five values as the central atom in the three-residue segments that are in red rectangles are too far from the three-residue segment in the blue rectangle to possibly form an antiparallel β sheet.

Lowering the STRANDS_CUTOFF value reduces the computational cost of the ANTIBETARMSD action. The orange line and right axis give the values we used for this cutoff. As discussed in the text, when you use a smaller STRANDS_CUTOFF you need to do fewer expensive RMSD or DRMSD calculations. The x-axis indicates how many of these DRMSD calculations are being performed for each cutoff. The blue line then shows how the time to perform these calculations changed as we increased the STRANDS_CUTOFF value. You can see that reducing this quantity to a reasonable value of 1.0 nm reduces the cost of the calculation by more than a factor of 2 as you move from needing to calculate over 5000 D/RMSD values to having to calculate less than 1000.

Cost of a single PLUMED step as a function of the number of atoms for which Q 6 parameters are being evaluated. The top panel shows how the cost of calculating the Q 6 parameters changes, while the bottom panel shows how the cost of calculating Q 6 and applying a force upon it changes. The various lines show the costs when the calculation is run on the numbers of OpenMP threads indicated in the legend.

Cost of a single PLUMED step as a function of the number of atoms for which the quantity defined in eq is being computed. The top panel shows how the cost of calculating this quantity changes, while the bottom panel shows how the cost of calculating this quantity and applying a force upon it changes. The various lines show the costs when the calculation is run on the numbers of OpenMP threads indicated in the legend.

Optimizing the evaluation local Q6 parameters (eq ) in a sphere. The lines indicate pairs of atoms that are within d max of each other. The green circles are the atoms that are within the blue region and for which we need to evaluate eq . To evaluate eq for these atoms we need to evaluate q lm values for all the atoms that are shown in green and all the atoms shown in red that are within d max of a green atom. Many of the atoms shown in red, for which we need to evaluate q lm, are not within the blue circle.

Cost of a single PLUMED step as a function of the radius of the spherical region in which the quantity defined in eq is being calculated. The bottom x-axis indicates the radius of the spherical region in which eq is being evaluated, while the numbers labeled #eq on the top x axis indicate the number of atoms that are within it. The numbers labeled #Q6 are the number of atoms for which eq must be evaluated.

Funding2

—H2020 European Research Council10.13039/100010663
—NextGenerationEU10.13039/100031478

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpreadsheets and End-User Computing · Organizational Learning and Leadership · Operations Management Techniques

Full text

Introduction

1

PLUMED (https://www.plumed.org)[?](#ref1) is an open-source software package that can be used to analyze and enhance molecular dynamics (MD) trajectories. Rather than operating as a monolithic software package, PLUMED serves as a framework for researchers who are using and developing advanced sampling techniques to share ideas. Consequently, in addition to the code, we also maintain a repository for sharing the inputs that have been used in publications that employ PLUMED (https://www.plumed-nest.org)[?](#ref2) and a site for sharing tutorials that explain how to use its features (https://www.plumed-tutorials.org).[?](#ref3)

As shown in Figure, PLUMED is designed to be inserted into MD codes and to work alongside them. It works by receiving the atomic positions from the underlying MD code, calculating various quantities from these positions and, if necessary, adjusting the forces that are acting upon the atoms. As we will discuss in the background section of this paper, PLUMED is written in a way that makes adding functionalities to calculate new quantities from these positions straightforward. This, combined with PLUMED’s interoperability, has enabled developers to contribute their methodologies ?−? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? within a unified ecosystem while maintaining compatibility across different molecular dynamics engines. ?−? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? By establishing common interfaces and documentation standards, PLUMED has transformed how enhanced sampling methods are disseminated, validated, and adopted and effectively democratized access to cutting-edge simulation techniques and accelerated methodological innovation in the field.

Interaction of PLUMED with molecular dynamics codes. PLUMED is able to calculate functions of the atomic positions and apply forces to atoms by passing data to and from the underlying MD code.

Typically the computational cost of running PLUMED alongside an MD code is small. If PLUMED is being used to calculate a simple function of a small number of atomic positions, the cost of this calculation is negligible when compared to the cost associated with calculating the atomic forces. However, PLUMED is increasingly being used to perform more computationally expensive calculations. Given that in such calculations PLUMED can have a significant effect on performance, a tutorial paper that explains how to get the best performance from PLUMED feels timely. In the following sections we will thus explain some of the more complicated calculations that PLUMED can be used to perform and will then discuss various approaches that can be used to make the parts of these calculations that are performed in PLUMED run more quickly. We begin this survey in Section by discussing collective variables that are functions of sums of distances, angles or torsions. Section then discusses the calculation of symmetry functions and the tricks that can be used when calculating such functions in particular part of the simulation cell. We then discuss secondary structure variables in Section before finishing with a discussion of the Steinhardt order parameters that are often used in studies of crystal nucleation in Section. Sections and ?, which precede this survey, provide a brief introduction to PLUMED and an explanation of how the benchmarks that we discuss in the rest of the paper have been generated.

Prerequisites

2

To repeat the calculations described in this paper you need to configure the development version of PLUMED with the ––enable-modules=all flag. PLUMED can then be compiled and run in the usual way. Our calculations were run on CINECA’s LEONARDO GPU partition. PLUMED was compiled using gcc 12.2.0 and OpenMPI 4.1.6.

The timings for PLUMED that are reported in later figures are averages from 10 simulations. Error bars are used to indicate the standard deviation in all figures but these error bars are often smaller than the sizes of the points used to indicate the average.

Background

3

Input to PLUMED typically consists of a single file that provides instructions on what PLUMED should calculate. An example input that illustrates how we can use PLUMED to add a restraint that forces the vector connecting atoms 1 and 2 to point in a direction that is perpendicular to the 111 direction of the lab frame is shown below.

Although this input performs a calculation that users would likely not perform, it does nicely illustrate PLUMED’s philosophy. Each line in the input above defines an action and, by passing values between these actions, as illustrated in Figure for the above input, the input defines a chain of operations that should be performed on the input positions. Notice that we also run through the action list backward as illustrated in the right panel of Figure in order to evaluate the forces that our actions apply on the atomic positions. Also notice that, because the vector connecting the two atoms here is evaluated using the minimal image convention, it depends on the simulation box size. As such, when we do the backward propagation of the forces additional forces are added on the positions of the two atoms and on the cell vectors. These forces on the cell vectors contribute to the internal pressure of the system.

Data communications between PLUMED actions. The left panel illustrates how the constituent actions in the first example input in this paper evaluate the bias function. The right panel shows how data is passed between actions when forces are evaluated using the chain rule. These figures were made by using the PLUMED command plumed show_graph, which outputs a Mermaid diagram.

The input above also illustrates how we routinely use PLUMED’s CUSTOM action, which relies on the Lepton library ?,? that was developed by Peter Eastman and that we extracted from OpenMM.? This action provides users with a way of specifying arbitrary functions to be applied on the input values. In the above input one can thus see how the director of the vector connecting atoms 1 and 2 is computed from the components of the unnormalized vector and how the angle between this vector and the (111) direction is computed by calculating the arccosine of a dot product.

Each action in a PLUMED input has an associated label. For the inputs in this paper these labels are the strings that appear before the colon. As indicated in the left panel of Figure, using these labels in the input to the ARG keyword of later actions allows one to reuse quantities calculated by earlier actions. In the input above, the quantities that are passed between the actions are all scalars. However, from PLUMED 2.10 onward you can also pass vectors in the same way. The following example illustrates how this passing of vectors works in practice. To make clear the types of value being passed we write the labels for quantities that are vectors in blue and labels for quantities that are scalars in black in this as well as all the inputs that follow.

The final quantity calculated in this input file is an order parameter that has been used to study liquid crystals.? Each pair of atoms specified in the distance command of the input gives an orientation for one of the molecules in the liquid crystal. The scalar quantity cv is thus equal to one if all the molecules in the liquid crystal have the same orientation and zero if all these molecules all have wildly different orientations.

To keep input files short, we provide shortcut commands. So, for example, you can calculate cv from the above input by using the single command:

When PLUMED encounters this command it automatically generates the longer input file above and then uses it to perform the calculation. This approach has three advantages:

1.It reduces the amount of code that needs to be maintained
2.It allows us to quickly document what the FERRONEMATIC_ORDER command is computing by showing the longer version of the command that this input expands into within the PLUMED manual.
3.The long version of the command is also reported in the PLUMED log file, which facilitates the identification of problems.

As we illustrate in the remainder of this paper, the tools described above for quickly prototyping and documenting methods allow for cross fertilization of methods, ideas and approaches across different simulation communities. Furthermore, by reusing the same actions as much as possible we can provide the universal recommendations for optimizing performance that are the focus of the rest of this tutorial paper.

How to Examine PLUMED’s Performance

4

Before discussing our prescriptions for improving the performance of PLUMED, it is worth explaining how the measures of PLUMED’s performance that we have quoted in this tutorial have been generated. As we mentioned in the introduction, the computational expense associated with the calculations that PLUMED is performing is often negligible when compared with the calculation of the atomic forces. Furthermore, even when the calculations PLUMED performs have a non-negligiable contribution to the total simulation time, potential nonreproducibilities in the performance of the MD code might add noise and make PLUMED performance difficult to measure. It is thus suboptimal to run PLUMED alongside an MD code to measure its performance during the development stage. Furthermore, although one can run stand alone analyses of trajectories using PLUMED’s driver utility, we often find that the time for such calculations is dominated by reading the trajectory. Consequently, using the plumed driver command to benchmark PLUMED is also misguided.

For these reasons, in PLUMED 2.10 we introduced a command line tool called plumed benchmark for reliably measuring performance across different variants of the PLUMED library. To get the graphs of performance in this paper we have used variations on the following command when employing this tool:

This command instructs PLUMED to repeatedly perform the calculations in the input file called pl.dat for a system of 1000 atoms that are arranged in an simple cubic structure. Consequently, the same set of positions are passed to PLUMED on every step but these positions are stored in memory so there is no need to do any molecular dynamics or disk access.

plumed benchmark has a number of features that may be useful for code developers who are worried about performance. Please note, first and foremost, that you can read the atomic positions that should be passed to PLUMED from most of the available trajectory formats. Consequently, if you are developing some exotic method to examine proteins or other complex molecules you can benchmark using a structure that is more relevant to the problem at hand than a simple cubic crystal by using a command such as:

PLUMED benchmark also allows you to run with multiple versions of PLUMED in parallel as illustrated below:

When you use this command, PLUMED alternates between performing the calculations using the version of PLUMED that is in the system PATH and the version of PLUMED in /path/to/libplumedkernel.so. The alternation is implemented to minimize the impact of the computer’s load on the relative performance of the two versions that are being compared. Once the calculation is finished timings for the calculations with the two (or more) versions of the code are output to the log. The values PLUMED benchmark reports try to offset the initialization cost by not including it in the timings, and report an error in the relative performance of different PLUMED versions estimated using bootstrap.? Running such benchmark calculations to compare stable and development versions of the code is obviously useful if you are working on trying to optimize performance. However, it is important to remember that the timings output by PLUMED benchmark will not tell you how long production calculations will take. Real-world performance will depend on technicalities related to the underlying MD code calculations, such as transfer of atoms from/to the GPU, use of caches, etc. We would always recommend that users fine-tune input files in an as-realistic-as-possible scenario, which often means running a short version of your production trajectory.

Calculating Multiple Distances, Angles, and

Torsions

5

There are two ways to use PLUMED to calculate the three distances between atom 1 and atoms 2, 3, and 4. You can use three actions that each pass out a single scalar as in the input below:

Or you can use one action that passes out a 3 dimensional vector as in the input below.

Similar pairs of options are available through the ANGLE and TORSION commands if you are calculating multiple angles or torsions.

When the number of distances being computed in these inputs is small, Figure suggests that you will likely get very similar performance from these two options. However, if you are calculating a larger number of distances, the second option will provide much better performance as there are fewer virtual function calls for this second option and because the calculation of the distances in this second option can be parallelized using OpenMP and MPI. The crossovers in the top panel of Figure suggest that using the second input becomes particularly important for getting the best performance out of PLUMED once you are computing approximately 100 distances. For sizes greater than this the cost of running the calculation with multiple OpenMP threads is cheaper than running the calculation on a single thread. If you have forces acting upon the distances using the input that allows for multi threading becomes important to performance once you are computing 50 or more distances (Figure lower panel).

Time per step as a function of the number of distances that are being computed. The blue, orange, light and dark green lines indicate the cost of using a single DISTANCE command to calculate the vector of distances using 1, 2, 4, and 8 OpenMP threads, respectively. Meanwhile, the dotted line shows the cost of using N separate DISTANCE commands. The top panel indicates the cost of calculating the distances only, while the bottom panel indicates the additional cost that comes if you apply a force on the computed distances and also need to calculate derivatives.

The input that was used to generate the scaling plots in Figure is shown below:

This input tells PLUMED to calculate distances for every kth and (k + 1)th atom pair in the system. Consequently, if there are n atoms in the system n – 1 distances will be computed if we use the input above. These distances are then all added together and a restraint is applied on this sum. Similar inputs that used all (n – 2) sets containing the kth, (k + 1)th and (k + 2)th atoms were used to benchmark the ANGLE command, while the (n – 3) sets containing the kth, (k + 1)th, (k + 2)th and (k + 3)th atoms were used to benchmark the TORSION command. The results obtained from these calculations are shown in Figure.

Time taken for a single PLUMED step as a function of the number of distances (left), angles (center) and torsions (right) that are being computed. Cost for just calculating these quantities (top panels). Cost for calculating and applying a force on the variables (bottom panels). Calculations were run on 1–32 OpenMP threads. The legend indicates what number of threads was used to produce each of the lines.

The cost of these calculations increases linearly with the number of atoms as would be expected (Figure). Furthermore, having a force on these quantities roughly doubles the cost of the calculation, which, given that PLUMED recalculates all the distances/angles/torsions and their derivatives when applying forces using the chain rule, is to be expected. Most importantly, however, for the largest systems studied you can get a roughly factor 5 speed up by using 32 OpenMP threads rather than a single thread.

OpenMP Versus MPI

6

In the previous section we noted that PLUMED calculations can be parallelized using OpenMP or MPI. We focused on presenting our benchmarks with different numbers of OpenMP threads rather than different numbers of MPI processes because of the results shown in the left panel of Figure. To generate the lines in this figure we ran the TORSION benchmark that was introduced in the previous section using 8 OpenMP threads, 8 MPI processes, a pair of 4 OpenMP threads that communicate via MPI and four pairs of OpenMP threads that communicate via MPI. You get the best performance when you use pure OpenMP parallelism (solid light green line).

Time taken for a single PLUMED step as a function of the number of torsions that are being computed. The top panels show how the cost of calculating the torsions increases while the bottom panel shows how the cost of calculating the torsions and applying a force on these quantities changes. All the calculations that were used to generate the solid lines for graphs in the left column were run on 8 processors. For the light green line, all processors communicated via OpenMP, while the blue line shows the result that was obtained when communication between the 8 processors was managed using MPI. The dark green and orange lines show the results obtained when the two communication protocols are mixed. The orange line shows timings that are obtained by having four MPI processors that each run on two OpenMP threads, while the dark green line indicates the result that is obtained by having two MPI processors running on four OpenMP threads each. The orange and green dashed lines are results obtained when you run with 2 and 4 OpenMP threads, respectively. The lines on the graphs in the right column were obtained from calculations that were parallelized over 1 to 32 MPI processes.

For these relatively small calculations, the cost per step decreases when you use up to 8 MPI processors (right panel Figure). However, when 16 or 32 MPI processes are used the cost of the calculation is increased by the additional communication so using fewer MPI processes is more efficient.

We also found that calculations running over N MPI processors each of which are running M OpenMP threads are often slower than calculations that simply run on M OpenMP threads. This is certainly the case for N = 2 and M = 4 but is not the case for N = 4 and M = 2 (dashed and solid orange and green lines left panel Figure). Consequently, if your MD code is running using a combination of OpenMP and MPI it may be worth using the SERIAL flag in the PLUMED input to turn on off all MPI parallelism in PLUMED. Having completely separate instances of the PLUMED calculations on each of the MPI processes is often faster than dividing the calculations between the MPI processes and then communicating the data to all nodes.

Working with Symmetry Functions

7

When studying phenomena such as crystal nucleation and growth using symmetry functions is commonplace. ?,?−? ? These functions have the general form

[eqn]

where (x _ ij _, y _ ij _, z _ ij _) is the vector connecting atoms i and j, r _ ij _ is this vector’s modulus and σ is a continuous switching function that is one when its argument is small and 0 when its argument is large. An example input that illustrates how such a function can be calculated using PLUMED is shown below:

Notice that in inputs such as the one above we are now passing matrices between actions as well as scalars and vectors. We use red to distinguish the labels of matrices from the vectors and scalars. Further note that in the same way as we did for the FERRONEMATIC_ORDER parameter that was discussed in Section, we provide shortcuts that hide this complex input from casual users. Importantly, however, decomposing the calculation in the manner shown in the above input allows us to use the same or similar actions for many different symmetry functions. Furthermore, by optimizing these common actions we improve performance for many different symmetry function types.

The first and most important trick for optimizing this input is the use of the D_MAX parameter in the input to the switching function that is used in the CONTACT_MATRIX command. A D_MAX value can be set whenever you define a switching function in PLUMED. By setting this parameter of the switching function you are enforcing the value of the switching function to be zero for all r > d max by using the stretching and scaling function that is computed from the switching function, θ, as follows.

[eqn]

Consequently, when we evaluate σ(r _ ij _) in the expression above we are not computing the usual rational switching function

[eqn]

σ(r _ ij _) is instead evaluated by inserting the value obtained for θ(r _ ij _) from eq into eq.

The fact that σ(r _ ij _) is guaranteed to be zero for all r _ ij _ > d max ensures that we can use the cell structures and linked list strategy that is discussed in? and illustrated in Figure to optimize the calculation of the contact matrix. This strategy works by first dividing the simulation cell into cubic boxes that each have a side length of d max. The box each of the input atoms resides in is then identified. When we evaluate the ith row of the contact matrix we only evaluate element i, j if atom j is in the same box as atom i or one of the 26 boxes that are adjacent to the box that contains atom i. When D_MAX is used the calculation of the symmetry functions thus scales linearly and not quadratically with the number of atoms. Furthermore, because we are normally only interested in the structure in the first coordination sphere around the atoms, the D_MAX value can be set to a relatively small value. For the example calculations in this tutorial, which are run on an ideal simple cubic structure with a lattice parameter of 1 nm, D_MAX is set equal to 2 nm so the sum in eq runs over the 18 atoms that are 1, $[eqn]$ or $[eqn]$ nm from the central atom. The boxes used in PLUMED are thus considerably smaller than those one would be using when exploiting similar tricks for evaluating the interatomic forces in an MD code.

Cell structures and linked list technique that is introduced in ref and employed within PLUMED. The cell box is divided into a set of cubes with a side length of d max. The cell each atom resides within is then determined at the start of the calculation. This reduces computational expense when we compute contact matrices because we can determine the neighbors of the blue atom by iterating over the atoms in the cell the blue atom resides in and the atoms in this cell’s immediate neighbors. The distance between the blue atom and any atom that is not in the same cell or one of its neighbors is guaranteed to be greater than d max.

Using D_MAX in the way described above also ensures that we can use sparse matrix storage for the cmap.w, cmap.x, cmap.y, cmap.z, r and f matrices in the above input and sparse matrix algebra when applying functions to the elements of the matrix in the CUSTOM actions and when multiplying these matrices by vectors in the MATRIX_VECTOR_PRODUCT actions to further improve performance.

When using an input such as the one above you can parallelize the calculation of all actions using OpenMP and MPI. Simulations were performed to determine how the time it takes PLUMED to perform a calculation using 1–32 OpenMP threads with the input above depends on the number of atoms that are input to the CONTACT_MATRIX command (Figure). You can see that adding a force on the symmetry functions increases the cost of the calculation by roughly a factor of 3. However, for the largest systems, using OpenMP reduces the cost of the calculation of the forces by a factor of 3. Even so, the cost of this calculation, when run with the large numbers of atoms that have been used here, is likely too great for it to be used as a CV in an MD simulation. However, the implementation discussed here and the implementations of other expensive quantities discussed in this paper can be used to generate training data for a cheaper-to-evaluate neural network as discussed in.?

Cost of a single PLUMED step as a function of the number of atoms for which symmetry functions are being evaluated. The top panel shows how the cost of calculating the symmetry functions changes, while the bottom panel shows how the cost of calculating the symmetry function and applying a force upon them changes. The various lines show the costs when the calculation is run on the numbers of OpenMP threads indicated in the legend.

Evaluating Symmetry Functions in a Particular

Part of the Box Using the MASK Keyword

8

To reduce the computational expense associated with the calculation of symmetry functions some developers sometimes choose to only evaluate the values of symmetry functions for those atoms in a particular part of the box.? This approach makes particular sense if one is examining nucleation at a surface? or if one is using a restraint to prevent a seed from dissolving.? This approach would also be necessary if one were computing symmetry functions when using the methods for running at constant chemical potential discussed in.?

The problem when using such approaches is that the atoms within the region of interest, for which the symmetry function has to be evaluated, changes dynamically as the simulation progresses and atoms exchange in and out of the region of interest. We consequently need some way for dynamically indicating the set of atoms for which the symmetry functions need to be evaluated. The following example input illustrates how the MASK keyword can be used for precisely this purpose:

The general form for the order parameter that is being evaluated here is given by the following expression

[eqn]

In this expression s _ i _ is the symmetry function that is defined in eq. w(x _ i _) is then a differentiable function of the position, x _ i _, of atom i that is one if the atom is in the region of interest and zero otherwise. In the input above this w(x _ i _) function is a switching function that acts upon the distance between atom i and the origin of the lab frame. The ith element of the vector, w, is thus one if x _ i _ is within a sphere centered on the origin and zero otherwise.

The important thing to note in this input is that the vector w is passed to the CONTACT_MATRIX action through the MASK keyword even though it is not needed to calculate the contact matrix. Passing this vector in this way ensures that the CONTACT_MATRIX only calculates the ith row of the matrix if the ith element in w is nonzero. In other words, by passing the vector w through the MASK keyword we ensure that s _ i _ values are only computed for those atoms that are in the sphere of interest.

To determine the effect this trick has on computational performance we used PLUMED running on 16 OpenMP threads to calculate the average value of the symmetry function in a spherical subregion of a system of 120,000 atoms (Figure). The bottom x-axis in this figure indicates the radius of the sphere in which the symmetry function is being evaluated, while the top axis is then used indicate the number of atoms that are within a sphere of this radius. You can clearly see how the cost of the calculation is reduced as the radius of the spherical region of interest decreases and the number of atoms for which eq is being evaluated decreases.

Cost of a single PLUMED step as a function of the volume of the spherical region in which you are evaluating symmetry functions for the atoms. The bottom x-axis indicates the radius of the spherical region in which the symmetry functions are being evaluated, while the top x axis indicates the number of atoms for which symmetry functions are being evaluated.

Another Use for the Mask Keyword

9

The example provided in the previous section illustrates one application of a common approach for working with vectors and matrices whose elements are a product of a part that is cheap to evaluate and a part that is more expensive to evaluate. As illustrated above, if you are working with such objects you first evaluate the object with the computationally cheap elements and determine if any of the elements of this vector are zero. You then use the value from this cheap action as a mask that tells the more computationally expensive action that there are elements of its output that do not need to be computed.

Another place where this approach is used in PLUMED is in the implementation of the STRANDS_CUTOFF keyword in the ANTIBETARMSD and PARABETARMSD actions.? The following example is a typical input that uses this keyword:

The ANTIBETARMSD command that is used here is an example of a shortcut action. When PLUMED reads this input it converts it into the input for a set of actions that together compute the ANTIBETARMSD collective variable which is defined as follows

[eqn]

In this expression the sum runs over all the six residue segments of protein that could conceivably form an antiparallel β sheet and X _ i _ is the positions of the 30 backbone atoms in each of these residue segments. X ref is the positions of the 30 backbone atoms in an ideal antiparallel β sheet so R(X _ i _, X ref) is the root-mean-square deviation (RMSD) distance between the instantaneous configuration of the backbone atoms in the ith residue segment and the ideal structure for an antiparallel β sheet. The sum in eq thus counts how many segments of the protein resemble an antiparallel β sheet.

PLUMED assumes that every pair of three residue segments that are separated by more than six residues along the chain can form an antiparallel beta sheets (Figure). This action is thus computationally expensive because number of potential antiparallel beta sheets scales quadratically with the number of residues.

Actions used to evaluate secondary structure variables.

*Method via which the STRANDS_CUTOFF keyword of ANTIBETARMSD improves the performance of this action. PLUMED needs to determine if the three residue segment in the blue rectangle forms an antiparallel β sheet with each of the three residues in each of the red and green rectangles by computing R(X

i , X ref). However, before computing these R(X

i , X ref) values, PLUMED computes the distance between the yellow atom and each of the pink atoms. The value of R(X

i , X ref) is then only computed if this distance is less than the STRANDS_CUTOFF value. Consequently, we compute two R(X

i , X ref) values rather than five values as the central atom in the three-residue segments that are in red rectangles are too far from the three-residue segment in the blue rectangle to possibly form an antiparallel β sheet.*

The particular set of actions that are used to compute the ANTIBETARMSD collective variable function and the way the values are passed between them are shown in Figure. Notice the dashed line that connects the CUSTOM action with label ab_cut and the SECONDARY_STRUCTURE_RMSD action with label ab_rmsd. This line illustrates that the vector ab_cut is being used as a MASK in the second action. Each element in this vector is only equal to one if the distance between the C alpha atoms of the two central residues of the three-residue segments that we are comparing to an idealized antiparallel β sheet is less than a cutoff. If this distance is larger than the cutoff then the element is zero. Consequently, by using this vector as a mask on the SECONDARY_STRUCTURE_RMSD action we ensure that the expensive calculation of R(X _ i _, X ref) in eq above is only performed for a small subset of the residues in the protein, which could potentially form an antiparallel β sheet.

To understand why this works consider the atoms involved in five of the R(X _ i _, X ref) values that we would have to calculate to obtain ANTIBETARMSD without STRANDS_CUTOFF that are shown in Figure. In evaluating eq we need to consider whether the atoms in the blue rectangle form an antiparallel β sheet with each of the three-residue segments shown in the green and red rectangles. However, if we compute the distances between the yellow atom and each of the pink atoms we immediately see that the configurations formed by the residues in the blue rectangle and each of the red rectangles cannot possibly resemble an antiparallel β sheet as the central atoms in the residues are far too far apart. To compute eq we thus only need to compute the two R(X _ i _, X ref) values that involve the atoms in the blue rectangle and the atoms in the two green rectangles.

By using the STRANDS_CUTOFF keyword correctly we can improve the performance of the ANTIBETARMSD action by a factor of 2 for a small protein system with only 180 residues (Figure). We have plotted the performance of the version of this CV that was implemented in the paper where Pietrucci and Laio? first introduced this variable which used the DRMSD to measure the distances between the instantaneous and reference structure in Figure. A revised version of this CV that we implemented in PLUMED, that uses the RMSD in place of the DRMSD is also available. The RMSD version of this CV is slightly cheaper than the DRMSD version but the difference in cost is marginal.

Lowering the STRANDS_CUTOFF value reduces the computational cost of the ANTIBETARMSD action. The orange line and right axis give the values we used for this cutoff. As discussed in the text, when you use a smaller STRANDS_CUTOFF you need to do fewer expensive RMSD or DRMSD calculations. The x-axis indicates how many of these DRMSD calculations are being performed for each cutoff. The blue line then shows how the time to perform these calculations changed as we increased the STRANDS_CUTOFF value. You can see that reducing this quantity to a reasonable value of 1.0 nm reduces the cost of the calculation by more than a factor of 2 as you move from needing to calculate over 5000 D/RMSD values to having to calculate less than 1000.

Steinhardt Parameters

10

Steinhardt parameters ?,?,? are a key component of many approaches for studying nucleation in atomic systems. This approach is often claimed to be superior to the approach based on the symmetry function that was introduced earlier because it accounts for rotational invariance. These rotational invariances are accounted for by computing all (2l + 1) spherical harmonics, Y _ l _ ^ m ^, for a particular l value using

[eqn]

Notice that this equation resembles eq, which was introduced in the section on symmetry functions. In writing it we have thus used the symbols that were defined when we introduced that earlier equation.

From these (2l + 1) complex numbers one can then compute a modulus using

[eqn]

for each of the individual atoms. Alternatively, one can compute the following product of the (2l +

spherical harmonics on atoms i and j.?

[eqn]

The advantage of this second approach over the first is that a comparison of the orientations of the coordination spheres around atoms i and j is performed. In other words, the dot product that is evaluated in the expression above is only large if the same q lm(i) and q lm(j) values are large on both atom i and atom j. This second approach is thus less affected if there are a significant number of q lm(i) components with moderately large values than the first.

The example input below illustrates how we can use PLUMED to calculate the average Q 6(i) value for all the atoms in the system.

The input here is a shortcut once again. An expanded version that does something similar to this shortcut is provided in the next example input below. We ran calculations to determine how the computational expense associated with evaluating this variable changes with system size (Figure). A comparison of Figures and ? illustrates that computing Q 6 is roughly six times more expensive than computing a symmetry function.

Cost of a single PLUMED step as a function of the number of atoms for which Q 6 parameters are being evaluated. The top panel shows how the cost of calculating the Q 6 parameters changes, while the bottom panel shows how the cost of calculating Q 6 and applying a force upon it changes. The various lines show the costs when the calculation is run on the numbers of OpenMP threads indicated in the legend.

As discussed above, order parameters that use eq in place of eq usually demonstrate superior performance at distinguishing between atoms that are part of the solid and liquid phases. A typical order parameter for an atom that is computed using eq is given by

[eqn]

To compute the average value of this order parameter for all the atoms in the system using PLUMED we would use the following input.

In previous versions of PLUMED the computational expense associated with doing the calculations in the input above was much greater than what is required to compute eq. However, a comparison between Figures and ? shows that there is no longer a large difference in the cost associated with computing eqs and ?. In other words, there is no longer a large computational penalty if you compute eq instead of ?.

Cost of a single PLUMED step as a function of the number of atoms for which the quantity defined in eq is being computed. The top panel shows how the cost of calculating this quantity changes, while the bottom panel shows how the cost of calculating this quantity and applying a force upon it changes. The various lines show the costs when the calculation is run on the numbers of OpenMP threads indicated in the legend.

Computing eq was expensive in earlier versions of PLUMED because the distances that are used in the CONTACT_MATRIX action were computed twice. These calculations are still done twice if you use the LOCAL_Q6 shortcut that is provided within PLUMED to maintain the older syntax for this command. By breaking up the various actions within PLUMED and allowing one to reuse expensive-to-compute values computed in one action in other parts of the input we have also improved the performance of the code.

Now suppose that you want to compute the average value of the quantity that is defined in eq for the subset of atoms that are in a particular part of the box. We cannot use the INSPHERE action in the input to the MASK keyword for the CONTACT_MAP action with label cmap from the above input in the way that was illustrated in Section because, as illustrated in Figure, we need to evaluate q lm(i) values for atoms that are not within the region of interest in order to evaluate eq for all the atoms in the region of interest.

Optimizing the evaluation local Q6 parameters (eq ) in a sphere. The lines indicate pairs of atoms that are within d max of each other. The green circles are the atoms that are within the blue region and for which we need to evaluate eq . To evaluate eq for these atoms we need to evaluate q lm values for all the atoms that are shown in green and all the atoms shown in red that are within d max of a green atom. Many of the atoms shown in red, for which we need to evaluate q lm, are not within the blue circle.

We can still use a MASK to reduce the expense of this calculation as is illustrated by the input that is shown below:

As you have to evaluate two CONTACT_MATRIX actions in the above input there will be cases where simply computing the values of s _ i _ using eq for all the atoms in the system and then computing the average value of this quantity in the region of interest without using the MASK keyword at all is computationally cheaper than using the input above. We thus ran calculations to determine how the length of time required to perform the calculation using 16 OpenMP threads for a system of 30,000 atoms depends on the radius of the spherical region (Figure). The bottom x-axis in this figure indicates the radius of the sphere in which eq is being evaluated. The top axis then shows how many atoms we need to evaluate eq for in order to evaluate eq for the atoms in the spherical region as well as the number of atoms in the sphere. The result is as you would expect; namely, when the sphere is large the computational expense associated with calculating the two contact matrices ensures that using the input above is slower than simply calculating eq for all atoms and averaging. However, when the sphere is smaller the reduction in computational cost that is associated with evaluating eq for a smaller number of atoms easily makes up for the cost that is added by evaluating the contact matrix twice. It is thus considerably more computationally efficient to use the input above whenever the volume of interest is small.

Cost of a single PLUMED step as a function of the radius of the spherical region in which the quantity defined in eq is being calculated. The bottom x-axis indicates the radius of the spherical region in which eq is being evaluated, while the numbers labeled #eq on the top x axis indicate the number of atoms that are within it. The numbers labeled #Q6 are the number of atoms for which eq must be evaluated.

Additional Tips

10.1

Having discussed some of the most recent optimizations added to the PLUMED codebase, we here report a checklist of other performance-optimization ideas that have been implemented and, that are discussed in previous papers or online tutorials:

When using metadynamics you should employ the implementation of the bias that stores the potential on a grid. You do this by employing the GRID_MIN, GRID_MAX and GRID_BIN keywords in your METAD command as discussed in.? Using a grid in these calculations prevents the cost associated with evaluating the bias potential from increasing in the later parts of the trajectory when more Gaussian functions have been added.?
At the time of writing, some collective variables in PLUMED use a standard neighbor list rather than the cell structures and linked list strategy from ref ? A notable example is COORDINATION. Typically the keywords NL_CUTOFF and NL_STRIDE are used to turn on the neighbor list. If you want to use this feature you will need to optimize the neighbor list parameters for both speed and correctness as discussed in.?
Some biasing potentials act on collective variables that have a smooth dependence on the atomic coordinates. When using such bias potentials you can use the multiple-time-step implementation discussed in. ?,?
Some actions in PLUMED are able to modify global coordinates. Examples include WHOLEMOLECULES and FIT_TO_TEMPLATE. For these actions, PLUMED cannot track dependencies in an optimal way. This means that you should carefully choose the STRIDE parameters for these actions to have correct results and to minimize their impact on the overall performances.

Conclusion

11

In this tutorial, we have demonstrated how to perform reliable and reproducible benchmarks of PLUMED’s performance using the recently introduced plumed benchmark tool. We have used this tool to present a series of benchmarks covering a diverse set of applications, from simple scalar quantities to more complex collective variables such as symmetry functions and Steinhardt parameters. These examples illustrate how performance can be optimized by employing vector-based operations, linked-list algorithms, and appropriate parallelization strategies.

We encourage developers who contribute new functionalities to PLUMED to follow a similar benchmarking approach. Providing benchmarks alongside contributed code not only helps ensure performance portability and transparency but also facilitates meaningful comparisons between implementations across different hardware and software environments. Performing these comparisons will become increasingly important as functionality is ported from the CPU to GPUs using frameworks such as CUDA, OpenACC, ArrayFire and SYCL. We also invite developers to explore alternative implementations of the vectorized calculations discussed here, for instance using emerging numerical frameworks such as JAX,? PyTorch,? or TensorFlow,? and to report and share their benchmarking results with the community. Additional development work and careful benchmarking would likely result in further improvements for all the methods discussed in this tutorial. Our hope is that by providing sufficient detail for readers to reimplement these functionalities elsewhere and benchmark new implementations against our own, we embrace the competitive and collaborative spirit that has always driven the best scientific software developmentone that values both innovation and rigorous evaluation in equal measure.

Looking forward, we envision that benchmarking could be further integrated into PLUMED’s development workflow. Automated benchmarking pipelines could regularly assess performance across multiple PLUMED versions and hardware configurations, generating plots similar to those shown here and enabling continuous monitoring of performance evolution. Such a system would not only streamline performance testing but also strengthen PLUMED’s role as a transparent and reproducible platform for method development in molecular simulations.

Bibliography77

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Tribello G. A.Bonomi M.Branduardi D.Camilloni C.Bussi G.PLUMED 2: New feathers for an old bird Comput. Phys. Commun.201418560461310.1016/j.cpc.2013.09.018 · doi ↗
2The PLUMED consortium Promoting transparency and reproducibility in enhanced molecular simulations Nat. Methods 20191667067310.1038/s 41592-019-0506-831363226 · doi ↗ · pubmed ↗
3Tribello G. A.Bonomi M.Bussi G.Camilloni C.Armstrong B. I.Arsiccio A.Aureli S.Ballabio F.Bernetti M.Bonati L.PLUMED Tutorials: A collaborative, community-driven learning ecosystem J. Chem. Phys.202516209250110.1063/5.025150140035582 PMC 12317779 · doi ↗ · pubmed ↗
4Tribello G. A.Giberti F.Sosso G. C.Salvalaglio M.Parrinello M.Analyzing and Driving Cluster Formation in Atomistic Simulations J. Chem. Theory Comput.2017131317132710.1021/acs.jctc.6b 0107328121147 · doi ↗ · pubmed ↗
5Bonati L.Trizio E.Rizzi A.Parrinello M.A unified framework for machine learning collective variables for enhanced sampling simulations: mlcolvar J. Chem. Phys.202315901480110.1063/5.015634337409767 · doi ↗ · pubmed ↗
6Klug J.Triguero C.Del Pópolo M. G.Tribello G. A.Using Intrinsic Surfaces To Calculate the Free-Energy Change When Nanoparticles Adsorb on Membranes J. Phys. Chem. B 20181226417642210.1021/acs.jpcb.8b 0366129851486 · doi ↗ · pubmed ↗
7Baldi E.Ceriotti M.Tribello G. A.Extracting the Interfacial Free Energy and Anisotropy from a Smooth Fluctuating Dividing Surface J. Phys.:Condens. Matter 20172944500110.1088/1361-648X/aa 893d 28853711 · doi ↗ · pubmed ↗
8Cheng B.Ceriotti M.Tribello G. A.Classical nucleation theory predicts the shape of the nucleus in homogeneous solidification J. Chem. Phys.202015204410310.1063/1.513446132007057 · doi ↗ · pubmed ↗