Accurate Sampling with Noisy Forces from Approximate Computing

Varadarajan Rengaraj; Michael Lass; Christian Plessl; Thomas D.; K\"uhne

arXiv:1907.08497·physics.comp-ph·April 29, 2020

Accurate Sampling with Noisy Forces from Approximate Computing

Varadarajan Rengaraj, Michael Lass, Christian Plessl, Thomas D., K\"uhne

PDF

TL;DR

This paper introduces an algorithmic approach within approximate computing to accurately sample noisy forces in atomistic simulations, compensating for low-precision errors to obtain exact expectation values.

Contribution

It proposes a novel method that rigorously corrects numerical inaccuracies from low-precision hardware in atomistic simulations using a modified Langevin equation.

Findings

01

Effective compensation for low-precision errors demonstrated

02

Exact expectation values achieved despite noisy forces

03

Applicable to high-performance, energy-efficient computing hardware

Abstract

In scientific computing, the acceleration of atomistic computer simulations by means of custom hardware is finding ever growing application. A major limitation, however, is that the high efficiency in terms of performance and low power consumption entails the massive usage of low-precision computing units. Here, based on the approximate computing paradigm, we present an algorithmic method to rigorously compensate for numerical inaccuracies due to low-accuracy arithmetic operations, yet still obtaining exact expectation values using a properly modified Langevin-type equation.

Figures5

Click any figure to enlarge with its caption.

Figure 4

Tables2

Table 1. Table 1: Bitwidth of common floating-point formats

Type	sign	exponent	mantissa
IEEE 754 Quadruple-precision	1	15	112
IEEE 754 Double-precision	1	11	52
IEEE 754 Single-precision	1	8	23
IEEE 754 Half-precision	1	5	10
Bfloat16 (truncated IEEE single-precision)	1	8	7

Table 2. Table 2: Values for γ N f i x superscript subscript 𝛾 𝑁 𝑓 𝑖 𝑥 \gamma_{N}^{fix} and γ N f l o a t superscript subscript 𝛾 𝑁 𝑓 𝑙 𝑜 𝑎 𝑡 \gamma_{N}^{float} as a function of β 𝛽 \beta .

$β$	$γ_{N}^{f i x}$	$γ_{N}^{f l o a t}$
0		0.00025
1	0.0004	0.000005
2	0.000009	0.000005
3	0.0000009

Equations22

F_{I}^{N} = F_{I}^{x} F_{I}^{y} F_{I}^{z} + c_{1} \times 1 0^{- β} c_{2} \times 1 0^{- β} c_{3} \times 1 0^{- β},

F_{I}^{N} = F_{I}^{x} F_{I}^{y} F_{I}^{z} + c_{1} \times 1 0^{- β} c_{2} \times 1 0^{- β} c_{3} \times 1 0^{- β},

F_{I}^{N} = F_{I}^{x} \times 1 0^{- α_{1}} F_{I}^{y} \times 1 0^{- α_{2}} F_{I}^{z} \times 1 0^{- α_{3}} + c_{1} \times 1 0^{- (α_{1} + β)} c_{2} \times 1 0^{- (α_{2} + β)} c_{3} \times 1 0^{- (α_{3} + β)} .

F_{I}^{N} = F_{I}^{x} \times 1 0^{- α_{1}} F_{I}^{y} \times 1 0^{- α_{2}} F_{I}^{z} \times 1 0^{- α_{3}} + c_{1} \times 1 0^{- (α_{1} + β)} c_{2} \times 1 0^{- (α_{2} + β)} c_{3} \times 1 0^{- (α_{3} + β)} .

F_{I}^{N} = F_{I} + Ξ_{I}^{N},

F_{I}^{N} = F_{I} + Ξ_{I}^{N},

⟨ F_{I} (0) Ξ_{I}^{N} (t) ⟩ ≅ 0

⟨ F_{I} (0) Ξ_{I}^{N} (t) ⟩ ≅ 0

M_{I} \ddot{R}_{I} = F_{I} + Ξ_{I}^{N} - γ_{N} M_{I} \dot{R}_{I},

M_{I} \ddot{R}_{I} = F_{I} + Ξ_{I}^{N} - γ_{N} M_{I} \dot{R}_{I},

⟨ Ξ_{I}^{N} (0) Ξ_{I}^{N} (t) ⟩ ≅ 2 γ_{N} M_{I} k_{B} T δ (t) .

⟨ Ξ_{I}^{N} (0) Ξ_{I}^{N} (t) ⟩ ≅ 2 γ_{N} M_{I} k_{B} T δ (t) .

M_{I} \ddot{R}_{I} = F_{I}^{N} - γ_{N} M_{I} \dot{R}_{I},

M_{I} \ddot{R}_{I} = F_{I}^{N} - γ_{N} M_{I} \dot{R}_{I},

⟨ \frac{1}{2} M_{I} \dot{R}_{I} ⟩ = \frac{3}{2} k_{B} T .

⟨ \frac{1}{2} M_{I} \dot{R}_{I} ⟩ = \frac{3}{2} k_{B} T .

\overset{γ}{˙}_{N} = (2 K - n k_{B} T) / Q,

\overset{γ}{˙}_{N} = (2 K - n k_{B} T) / Q,

⟨ F_{I}^{N} (0) F_{I}^{N} (t) ⟩

⟨ F_{I}^{N} (0) F_{I}^{N} (t) ⟩

=

=

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Abstract

In scientific computing, the acceleration of atomistic computer simulations by means of custom hardware is finding ever growing application. A major limitation, however, is that the high efficiency in terms of performance and low power consumption entails the massive usage of low-precision computing units. Here, based on the approximate computing paradigm, we present an algorithmic method to rigorously compensate for numerical inaccuracies due to low-accuracy arithmetic operations, yet still obtaining exact expectation values using a properly modified Langevin-type equation.

keywords:

approximate computing, CP2K, fluctuation-dissipation theorem, FPGA, i-PI, low-precision arithmetic

\pubvolume

xx \issuenum1 \articlenumber5

\history

\TitleAccurate Sampling with Noisy Forces from Approximate Computing

\AuthorVaradarajan Rengaraj 1,3,‡, Michael Lass ${}^{3,4,{\ddagger}}\orcidC{}$ , Christian Plessl 3,4\orcidA, and Thomas D. Kühne ${}^{1,2}*$ \orcidB \AuthorNamesVaradarajan Rengaraj, Michael Lass, Christian Plessl and Thomas D. Kühne

\corresCorrespondence: [email protected] \secondnoteThese authors contributed equally to this work.

1 Introduction

Molecular dynamics (MD) is a very powerful and widely used technique to study thermodynamic equilibrium properties, as well as the real-time dynamics of complex systems made up of interacting atoms Alder and Wainwright (1957). This is done by numerically solving Newton’s equations of motion in a time-discretized fashion via computing the nuclear forces of all atoms at every time step Rahman (1964). Computing these forces by analytically differentiating the interatomic potential with respect to the nuclear coordinates is computationally rather expensive, which is particularly true for electronic structure based ab-initio MD simulations Car and Parrinello (1985); Kühne et al. (2007); Payne et al. (1992); Kühne (2014).

For a long time newly developed microchips became faster and more efficient over time due to new manufacturing processes and shrinking transistor sizes. However, this development slowly comes to an end as scaling down the structures of silicon based chips becomes more and more difficult. The focus therefore shifts towards making efficient use of the available technology. Hence, beside algorithmic developments Tuckerman et al. (1992); Snir (2004); Shan et al. (2005); Shaw (2005); Gonnet (2012, 2007); John et al. (2016); Kühne and Prodan (2018), there have been numerous custom computing efforts in this area to increase the efficiency of MD simulations by means of hardware acceleration, which we take up in this work. Examples of the latter are MD implementations on graphics processing units (GPUs) Anderson et al. (2008); Stone et al. (2010); Eastman and Pande (2010); Colberg and Höfling (2011); Brown et al. (2012); Le Grand et al. (2013); Abraham et al. (2015), field-programmable gate arrays (FPGAs) Herbordt et al. (2008a, b), and application-specific integrated circuits (ASICs) Shaw et al. (2007, 2014). While the use of GPUs for scientific applications is relatively widespread Owens et al. (2008); Preis et al. (2009); Weigel (2012), the use of ASICs Brown and Christ (1988); Boyle et al. (2005); Hut and Makino (1999); Fukushige et al. (1999) and FPGAs is less common Belletti et al. (2009); Baity-Jesi et al. (2014); Meyer et al. (2012); Giefers et al. (2014); Kenter et al. (2017, 2018), but gained attention over the last years. In general, to maximize the computational power for a given silicon area, or equivalently minimize the power-consumption per arithmetic operation, more and more computing units are replaced with lower-precision units. This trend is mostly driven by market considerations of the gaming and artificial intelligence industries, which are the target customers of hardware accelerators and naturally do not absolutely rely on full computing accuracy.

In the approach presented in this paper, we mimic in software how it is possible to make effective use of low-accuracy special-purpose hardware for general-purpose scientific computing by leveraging the approximate computing (AC) paradigm Klavík et al. (2014); Plessl et al. (2015). The general research goal of AC is to devise and explore ingenious techniques to relax the exactness of a calculation to facilitate the design of more powerful and/or more efficient computer systems. However, in scientific computing, where the exactness of all computed results is of paramount importance, attenuating accuracy requirements is not an option. Yet, assuming that the inaccuracies within the nuclear forces due to the usage of low-precision arithmetic operations can be approximately considered as white, we will demonstrate that it is nevertheless possible to rigorously compensate for such numerical errors and still obtain exact expectation values, as obtained by ensemble averages of a properly modified Langevin equation.

The remainder of the paper is organized as follows. In Section 2 we revisit the basic principles of AC before introducing our modified Langevin equation in Section 3. Thereafter, in Section 4, we describe the computational details of our computational experiments. Our results are presented and discussed in Section 5 before concluding the paper in Section 6.

2 Approximate Computing

A basic method of approximation and a key requirement for efficient use of processing hardware is the use of adequate data widths in computationally intensive kernels. While in many scientific applications the use of double-precision floating-point is most common, this precision is not always required. For example, iterative methods can exhibit resilience against low precision arithmetic as has been shown for the computation of inverse matrix roots Lass et al. (2018) and for solving systems of linear equations Klavík et al. (2014); Angerer et al. (2016); Haidar et al. (2017, 2018). Mainly driven by the growing popularity of artificial neural networks Gupta et al. (2015), we can observe growing support of low-precision data types in hardware accelerators. In fact, recent GPUs targeting the data center have started supporting half-precision as well, nearly doubling the peak performance compared to single-precision and quadrupling it compared to double-precision arithmetics NVIDIA Corporation (2016). However, due to the low number of exponent bits, half-precision only provides a very limited dynamic range. In contrast, bfloat16 provides the same dynamic range as single-precision, and just reduces precision. It is currently supported by Google’s Tensor Processing Units (TPU) The Next Platform (2018) and support is announced for future Intel Xeon processors Top 500 (2018) and Intel AgileX FPGAs. A list of commonly used data types, together with the corresponding number of bits used to store the exponent and the mantissa, are shown in Table 1 beside the double-precision de facto standard.

Yet, programmable hardware such as FPGAs, as a platform for custom-built accelerator designs Strzodka and Goddeke (2006); Kenter et al. (2014, 2012), can make effective use of all of these, but also entirely custom number formats. Developers can specify the number of exponent and mantissa bits and trade off precision against the amount of memory blocks required to store values and the number of logic elements required to perform arithmetic operations on them.

In addition to floating-point formats, also fixed-point representations can be used. Here, all numbers are stored as integers of fixed size with a predefined scaling factor. Calculations are thereby performed using integer arithmetic. On CPUs and GPUs only certain models can perform integer operations with a peak performance similar to that of floating-point arithmetic, depending on the capabilities of the vector units / stream processors. Nevertheless, FPGAs typically can perform integer operations with performance similar to or even higher than that of floating-point. Due to the high flexibility of FPGAs with respect to different data formats and the possible use of entirely custom data types, we see them as the main target technology for our work. For this reason, we consider both floating-point and fixed-point arithmetic in the following.

3 Methodology

The error introduced by low-precision floating-point or fixed-point computations can in general be modeled as white noise if unbiased rounding techniques are used in all arithmetic operations. A widely employed rounding technique is round half to even, which does not introduce a systematic bias, and is used by default in IEEE 754 floating-point arithmentic Microprocessor Standards Committee of the IEEE Computer Society (2019). In the following, we assume the usage of such a rounding technique also for fixed-point arithmetic, leading to an only unbiased error within the computed interatomic forces.

To demonstrate the concept of approximate computing, we introduce white noise to the interatomic forces that are computed while running the MD simulation. In this section, we describe in detail how we introduce the noise to mimic in software the behavior that would be observed when running the MD on the actual FPGA or GPU hardware with reduced numerical precision. We classify the computational errors into two types: fixed-point errors, and floating-point errors. Assuming that $\textbf{F}_{I}$ are the exact and $\textbf{F}_{I}^{N}$ the noisy forces from a MD simulation with low precision on an FPGA for instance, fixed-point errors can by modelled by

[TABLE]

whereas floating-point errors are described by

[TABLE]

Therein, $c_{1}$ , $c_{2}$ and $c_{3}$ are random values chosen in the range [-0.5, 0.5], whereas $\text{F}_{I}^{x}$ , $\text{F}_{I}^{y}$ and $\text{F}_{I}^{x}$ are the individual force components of $\textbf{F}_{I}$ , respectively. The floating-point scaling factor is denoted as $\alpha$ and the magnitude of the applied noise by $\beta$ .

To rigorously correct the errors introduced by numerical noise we employ a modified Langevin equation. In particular, we model the force as obtained by a low-precision computation on a GPU or FPGA-based accelerator as

[TABLE]

where $\mathbf{\Xi}_{I}^{N}$ is an additive white noise for which

[TABLE]

holds. Throughout, $\langle\cdots\rangle$ denotes Boltzmann-weighted ensemble averages as obtained by the partition function $Z=\text{Tr}\exp(-E/k_{B}T)$ , where $E$ is the potential energy, $k_{B}$ the so-called Boltzmann constant, and $T$ the temperature. Given that $\mathbf{\Xi}_{I}^{N}$ is unbiased, which in our case is true by its very definition, it is nevertheless possible to accurately sample the Boltzmann distribution by means of a Langevin-type equation Krajewski and Parrinello (2006); Richters and Kühne (2014); Karhan et al. (2014), which in its general form reads as

[TABLE]

where $\dot{\textbf{R}}_{I}$ are the nuclear coordinates (the dot denotes time derivative), $M_{I}$ are the nuclear masses and $\gamma_{N}$ is a damping coefficient, which is chosen to compensate for $\mathbf{\Xi}_{I}^{N}$ . The latter, in order to guarantee an accurate canonical sampling, has to obey the fluctuation-dissipation theorem

[TABLE]

Substituting Eq. 3 into Eq. 5 results in the desired modified Langevin equation

[TABLE]

which will be used throughout the remaining of this paper. This is to say that the noise, as originating from a low-precision computation, can be thought of as the additive white noise of a damping coefficient $\gamma_{N}$ , which satisfies the fluctuation-dissipation theorem of Eq. 6. The specific value of $\gamma_{N}$ is determined in such a way so as to generate the correct average temperature, as measured by the equipartition theorem

[TABLE]

4 Computational details

To demonstrate our approach we have implemented it in the CP2K suite of programs Hutter et al. (2014); Kühne et al. (2020). More precisely, we have conducted MD simulations of liquid Silicon (Si) at 3000 K using the environment-dependent interatomic potential of Bazant et al. Bazant and Kaxiras (1996); Bazant et al. (1997). All simulations consisted of 1000 Si atoms in a 3D-periodic cubic box of length 27.155 Å. Using the algorithm of Ricci and Ciccotti Ricci and Ciccotti (2003), Eq. 5 was integrated with a discretized timestep of 1.0 fs with $\gamma_{N}=0.001\leavevmode\nobreak\$ fs*-1*.

Whereas the latter settings were used to compute our reference data, in total six different cases of fixed-point and floating-point errors were investigated by varying the exponent $\beta$ between 0 (huge noise) and 3 (tiny noise) that is, ranging from $1/1000$ of the physical force up to the same magnitude as the force. As already alluded to above, the additive white noise is compensated via Eq. 7 by continously adjusting the friction coefficient $\gamma_{N}$ using the adaptive Langevin technique of Leimkuhler and coworkers so as to satisfy the equipartition theorem of Eq. 8 Jones and Leimkuhler (2011); Mones et al. (2015); Leimkuhler et al. (2019). In this method, $\gamma_{N}$ is reinterpreted as a dynamical variable, defined by a negative feedback loop control law as in the Nosé-Hoover scheme Nosé (1984); Hoover (1985). The corresponding dynamical equation for $\gamma_{N}$ reads as

[TABLE]

where $K$ is the kinetic energy, $n$ is the number of degrees of freedom and $\mathcal{Q}=k_{B}T\tau^{2}_{NH}$ is the Nose-Hoover fictitious mass with time constant $\tau_{NH}$ . Alternatively, $\gamma_{N}$ can be estimated by integrating the autocorrelation function of the additive white noise Scheiber et al. (2018). In Table 2 the resulting values of $\gamma_{N}^{fix}$ for fixed-point and $\gamma_{N}^{float}$ for floating-point errors are reported as a function of $\beta$ .

5 Results and Discussion

As can be directly deduced from Table 2, the smaller values of $\gamma_{N}$ for a given $\beta$ immediately suggest the higher noise resilience when using floating-point as compared to fixed-point numbers.

In Figs. 1 and 2, the Si-Si partial pair-correlation function $g(r)$ , which describes how the particle-density varies as a function of distance from a reference particle (atoms, molecules, colloids, etc.), as computed using an optimal scheme for orthorombic systems Röhrig and Kühne (2013), is shown for different values of $\beta$ . As can be seen, for both fixed-point and floating-point errors, the agreement with our reference calculation is nearly perfect up to the highest noise we have investigated. As already anticipated earlier, the usage of floating-point errors is not only able to tolerate higher noise levels, but is also throughout more accurate.

To verify that the sampling is indeed canonical, in Fig. 3 the actual kinetic energy distribution as obtained by our simulations using noisy forces is depicted and compared to the analytic Maxwell distribution. It is evident that if sampled long enough, not only the mean value, but also the distribution tails are in excellent agreement with the exact Maxwellian kinetic energy distribution, which demonstrates that we are indeed performing a canonical sampling.

To further assess the accuracy of the present method, we expand the autocorrelation of the noisy forces, i.e.

[TABLE]

Since the cross correlation terms between the exact force and the additive white noise is vanishing due to Eq. 4, comparing the autocorrelation of the noisy forces $\langle\textbf{F}_{I}^{N}(0)\textbf{F}_{I}^{N}(t)\rangle$ with the autocorrelation of the exact forces $\langle\textbf{F}_{I}(0)\textbf{F}_{I}(t)\rangle$ permits to assess the localization of the last term of Eq. 10. The fact that $\langle\textbf{F}_{I}^{N}(0)\textbf{F}_{I}^{N}(t)\rangle$ is essentially identical to $\langle\textbf{F}_{I}(0)\textbf{F}_{I}(t)\rangle$ , as can be seen in Fig. 4, implies that $\langle\mathbf{\Xi}_{I}^{N}(0)\mathbf{\Xi}_{I}^{N}(t)\rangle$ is very close to a $\delta$ -function as required by the fluctuation-dissipation theorem in order to ensure an accurate canonical sampling of the Boltzmann distribution. In other words, from this it follows that our initial assumption underlying Eq. 7, to model the noise due to a low-precision calculation as an additive white noise channel, is justified.

6 Conclusion

We conclude by noting that the present method has been recently implemented in the universal force engine i-PI Kapil et al. (2019), which can be generally applied to all sorts of forces affected by stochastic noise such as those computed by GPUs or other hardware accelerators Anderson et al. (2008); Stone et al. (2010); Eastman and Pande (2010); Colberg and Höfling (2011); Brown et al. (2012); Le Grand et al. (2013); Abraham et al. (2015), and potentially even quantum computing devices Steane (1999); Knill (2005); Benhelm et al. (2008); Chow et al. (2014). The possibility to apply similar ideas to N-body simulations Efstathiou et al. (1985); Hernquist et al. (1993) and to combine it with further algorithmic approximations Lass et al. (2018) is to be underlined and will be presented elsewhere.

\funding

The authors would like to thank the Paderborn Center for Parallel Computing (PC2) for computing time on OCuLUS and FPGA-based supercomputer Noctua. Funding from the Paderborn University’s research award for “Green IT” is kindly acknowledged. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant Agreement No.: 716142) and from the German Research Foundation (DFG) under the project PerficienCC (grant agreement No PL 595/2-1).

\reftitle

References

Bibliography75

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alder and Wainwright (1957) Alder, B.J.; Wainwright, T.E. Phase Transition for a Hard Sphere System. J. Chem. Phys. 1957 , 27 , 1208–1209. doi: \changeurlcolor black 10.1063/1.1743957 . · doi ↗
2Rahman (1964) Rahman, A. Correlations in the Motion of Atoms in Liquid Argon. Phys. Rev. 1964 , 136 , A 405–A 411. doi: \changeurlcolor black 10.1103/Phys Rev.136.A 405 . · doi ↗
3Car and Parrinello (1985) Car, R.; Parrinello, M. Unified approach for molecular dynamics and density-functional theory. Phys. Rev. Lett. 1985 , 55 , 2471–2474. doi: \changeurlcolor black 10.1103/Phys Rev Lett.55.2471 . · doi ↗
4Kühne et al. (2007) Kühne, T.D.; Krack, M.; Mohamed, F.R.; Parrinello, M. Efficient and accurate Car-Parrinello-like approach to Born-Oppenheimer molecular dynamics. Phys. Rev. Lett. 2007 , 98 , 066401. doi: \changeurlcolor black 10.1103/Phys Rev Lett.98.066401 . · doi ↗
5Payne et al. (1992) Payne, M.C.; Teter, M.P.; Allan, D.C.; Arias, T.A.; Joannopoulos, J.D. Iterative minimization techniques for ab initio total-energy calculations: molecular dynamics and conjugate gradients. Rev. Mod. Phys. 1992 , 64 , 1045–1097. doi: \changeurlcolor black 10.1103/Rev Mod Phys.64.1045 . · doi ↗
6Kühne (2014) Kühne, T.D. Second generation Car–Parrinello molecular dynamics. WIR Es Comput. Mol. Sci. 2014 , 4 , 391–406. doi: \changeurlcolor black 10.1002/wcms.1176 . · doi ↗
7Tuckerman et al. (1992) Tuckerman, M.E.; Berne, B.J.; Martyna, G.J. Reversible multiple time scale molecular dynamics. J. Chem. Phys. 1992 , 97 , 1990–2001. doi: \changeurlcolor black 10.1063/1.463137 . · doi ↗
8Snir (2004) Snir, M. A note on N-body computations with cutoffs. Theor. Comput. Syst. 2004 , 37 , 295–318. doi: \changeurlcolor black 10.1007/s 00224-003-1071-0 . · doi ↗