TL;DR
This paper demonstrates how accessible graphics cards can significantly accelerate Monte Carlo simulations in statistical physics, enabling live classroom demonstrations of complex phenomena like phase transitions in social dilemma models.
Contribution
It introduces a practical approach to enhance Monte Carlo simulations using GPUs, with a focus on physics education and providing accessible source code for replication.
Findings
Graphics cards can increase Monte Carlo simulation speed by orders of magnitude.
The public goods game exhibits a second-order phase transition in the directed percolation class.
GPU acceleration makes complex simulations feasible for classroom demonstrations.
Abstract
The use of computers in statistical physics is common because the sheer number of equations that describe the behavior of an entire system particle by particle often makes it impossible to solve them exactly. Monte Carlo methods form a particularly important class of numerical methods for solving problems in statistical physics. Although these methods are simple in principle, their proper use requires a good command of statistical mechanics, as well as considerable computational resources. The aim of this paper is to demonstrate how the usage of widely accessible graphics cards on personal computers can elevate the computing power in Monte Carlo simulations by orders of magnitude, thus allowing live classroom demonstration of phenomena that would otherwise be out of reach. As an example, we use the public goods game on a square lattice where two strategies compete for common resources…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
††thanks: Electronic address: [email protected]
High-performance parallel computing in the classroom using the public goods game as an example
Matjaž Perc
Faculty of Natural Sciences and Mathematics, University of Maribor, Koroška cesta 160, SI-2000 Maribor, Slovenia
CAMTP – Center for Applied Mathematics and Theoretical Physics, University of Maribor, Mladinska 3, SI-2000 Maribor, Slovenia
Abstract
The use of computers in statistical physics is common because the sheer number of equations that describe the behavior of an entire system particle by particle often makes it impossible to solve them exactly. Monte Carlo methods form a particularly important class of numerical methods for solving problems in statistical physics. Although these methods are simple in principle, their proper use requires a good command of statistical mechanics, as well as considerable computational resources. The aim of this paper is to demonstrate how the usage of widely accessible graphics cards on personal computers can elevate the computing power in Monte Carlo simulations by orders of magnitude, thus allowing live classroom demonstration of phenomena that would otherwise be out of reach. As an example, we use the public goods game on a square lattice where two strategies compete for common resources in a social dilemma situation. We show that the second-order phase transition to an absorbing phase in the system belongs to the directed percolation universality class, and we compare the time needed to arrive at this result by means of the main processor and by means of a suitable graphics card. Parallel computing on graphics processing units has been developed actively during the last decade, to the point where today the learning curve for entry is anything but steep for those familiar with programming. The subject is thus ripe for inclusion in graduate and advanced undergraduate curricula, and we hope that this paper will facilitate this process in the realm of physics education. To that end, we provide a documented source code for an easy reproduction of presented results and for further development of Monte Carlo simulations of similar systems.
Monte Carlo method, parallel computing, public goods game, graphics processing unit
I Introduction
The use of computers to solve problems in statistical physics has a long and fruitful history, dating as far back as the Manhattan Project, where analog computers were used so frequently they often broke down. Digital computers, such as the ENIAC (Electronic Numerical Integrator and Computer), were intertwined with nuclear science from its onset onwards. In fact, one of the first real uses of ENIAC was by Edward Teller, who used the machine in his early work on nuclear fusion reactions Atomic Heritage Foundation (2016). Today, computers are used in practically all areas of physics, and it is indeed difficult to imagine scientific progress without them. As rightfully pointed out by Newman and Barkema Newman and Barkema (1999), Monte Carlo methods form the largest and the most important class of numerical methods for solving statistical physics problems. Not surprisingly, in addition to fascinating original research dating back more than three decades Binder and Müller-Krumbhaar (1974); Binder and Landau (1980); Binder (1981), the subject is covered in reviews Binder (1997); Hinrichsen (2000); Ódor (2004) and many books Binder and Hermann (1988); Newman and Barkema (1999); Marro and Dickman (1999); Landau and Binder (2000) in varying depth.
While the famous Ising model Ising (1925); Onsager (1944) is the workhorse behind many introductory as well as less introductory texts on the subject Ibarra-García-Padilla et al. (2016); Newman and Barkema (1999), and is likely the most thoroughly researched model in the whole of statistical physics Krauth (2006), we here use another example for two reasons. In the first place, how to adapt the classical Monte Carlo algorithm for the Ising model to become fit for parallel computing on a graphics processing unit has already been demonstrated by Preis et al. Preis et al. (2009). In fact, parallel computing has already been applied to a number of other statistical physics problems, such as to the dimensional surface growth model Schulz et al. (2011), to the Kardar-Parisi-Zhang model Kelling et al. (2012); Kelling and Ódor (2011); Ódor et al. (2014), to simulate stochastic differential equations Januszewski and Kostur (2010), Brownian motors Spiechowicz et al. (2015), stochastic processes Barros et al. (2011), and Arnold diffusion Seibert et al. (2011), as well as to study ferrofluids Polyakov et al. (2013), photon migration Alerstam et al. (2008), anomalous coarsening in the disordered exclusion process Juhász and Ódor (2012), and probability-based simulations in general Tomov et al. (2005). Secondly, it might be welcome to add a little color to the curriculum by expanding on the classical subjects and thus to increase the popularity of physics with the students, although the public goods game has been a fixture in statistical physics research for at least a decade Szabó and Fáth (2007); Perc et al. (2013).
The public goods game that we use here as an example is often studied in the realm of evolutionary game theory as the paradigmatic case of a social dilemma Sigmund (1993); Weibull (1995); Hofbauer and Sigmund (1998); Nowak (2006a); Sigmund (2010). The blueprint of the game is simple. The public goods game is played in groups, wherein each member of the group can choose between two strategies. If a member chooses to cooperate, it contributes a fixed amount to the common pool (). Conversely, if a player chooses to defect, it contributes nothing to the common pool (). The contributions from all the cooperators within a group are summed together and multiplied by a so-called synergy factor . The latter takes into account the added value of a group effort. Lastly, the sum total of public goods after the multiplication is divided equally among all group members, and this regardless of their strategies. It is thus straightforward to see that an individual member is best of if it chooses to defect, because it can enjoy the same benefits as cooperators whilst contributing nothing. However, if everybody chooses to defect the factor multiplies zero and the public goods are lost to all. The latter scenario is often referred to as the tragedy of the commons Hardin (1968). Hence the classical social dilemma is given, where what is best for an individual is at odds with that is best for the group or the society as a whole. The question is under which conditions cooperation can nevertheless evolve.
Methods of statistical physics have recently been applied to subjects that, in the traditional sense, could be considered as out of scope. Statistical physics of social dynamics Castellano et al. (2009), of evolutionary games in structured populations Szabó and Fáth (2007); Perc and Szolnoki (2010); Wang et al. (2015), of crime D’Orsogna and Perc (2015), and of epidemic processes and vaccination Pastor-Satorras et al. (2015); Wang et al. (2016), are all recent examples of this exciting development. And the evolution of cooperation in the realm of the public goods game is no exception Perc et al. (2013); Perc (2016). Especially the consideration of the public goods game in a structured population is within the domain of statistical physics Szolnoki et al. (2009); Szolnoki and Perc (2013); Javarone et al. (2016); Javarone and Battiston (2016). In the simplest case, a structured population is described by the square lattice, whereon cooperators can form compact clusters and can thus avoid, at least those in the interior of such clusters, being exploited by defectors Nowak and May (1992).
When studying the public goods game on a square lattice, Monte Carlo simulations are used for random sequential strategy updating. This ensures that the treatment is aligned with fundamental principles of statistical mechanics, and it enables a comparison of obtained results with generalized mean-field approximations Dickman (2001); Szolnoki (2002); Dickman (2002); Szolnoki et al. (2005) as well as a proper determination of phase transitions between different stable strategy configurations. However, such monte Carlo simulations require significant computational resources, especially if the size of the lattice is large, and if the system is close to a phase transition where fluctuations are strong. It is thus of interest to utilize parallel computing that is nowadays possible on many graphic cards installed in personal computers. Here we use the NVIDIA graphic card GeForce GTX 1080 and the CUDA programming environment NVIDIA Corporation (2017). The latter is designed to work with all main programming languages, including C that we use, as well as C++, C#, Fortran, Python and Java. Graphic cards that support the CUDA programming environment are today available from hundred euros upward from all main graphic card manufacturers.
In what follows, we briefly describe the CUDA programming environment in Section II, and we describe the public goods game and the parallelization of the Monte Carlo method in Sections III and IV, respectively. In Section V we present the results and compare the performance of parallel computing with traditional CPU-based computing. Lastly, we sum up and discuss the development of similar Monte Carlo simulations for related systems in Section VI.
II CUDA programming environment
The CUDA programming environment is freely available and upon installation integrates itself seamlessly with existing compilers available in the operating system. It includes the CUDA compiler, math libraries, as well as tools for debugging and performance optimization. The documentation available online at NVIDIA Corporation (2017) is comprehensive and clear, so the reader is advised to look for details there.
The CUDA extension to the C programming language introduces new keywords and expressions that enable the user to distinguish between variables and functions that are stored in RAM and execute on the CPU (the host), and variables and functions (typically called kernels) that are stored and executed on the graphics processing unit (the device). Likewise, special keywords and expressions are available to transfer data between the host and the device.
The graphics processing unit that supports CUDA is composed of streaming multiprocessor, each of which is further composed of several scalar processors. Each streaming multiprocessor has different memory types available to it, namely a set of 32-bit registers, a shared memory block, as well as global memory. The 32-bit registers are fastest and smallest, while the global memory is largest and slowest. A scalar processor can access only its own register, each scalar processor in a particular streaming multiprocessor can access its own shared memory block, while the global memory is accessible to all scalar processors in all streaming multiprocessor (hence the whole device) as well as to the host. An important part of efficient parallelization of the problem involves minimizing the need to access global memory, and to make full use of the registers and the shared memory block. Although respecting the memory hierarchy is key to a fully optimized solution, significant improvements in computing power are attainable even if the memory is handled less than optimally.
From the programming point of view, each problem needs to be split into parts that can then be processed in parallel by threads. Threads form blocks of threads, and blocks further form grids of blocks. The total number of threads is thus the number of threads in each block times the numbers of blocks in the grid. Each thread is executed by a scalar processor, and each block of threads is assigned to a particular streaming multiprocessor. All threads within a block can thus access the previously mentioned shared memory block. The code that is executed within each thread is called a kernel, and each thread has a unique ID that is determined based on the block and grid structure. These are the basic concepts of parallel computing that quickly become familiar to those that decide to implement it.
III The spatial public goods game
We use the spatial public goods game Szabó and Hauert (2002); Szolnoki et al. (2009); Perc et al. (2013) to demonstrate the parallelization of the Monte Carlo algorithm within the CUDA programming environment. The game is staged on a square lattice with periodic boundary conditions where players are arranged into overlapping groups of size , such that everyone is connected to its four nearest neighbors. As schematically illustrated in Fig. 1, each player is therefore member in different groups. Players that cooperate () contribute into the common pool, while players that defect () contribute nothing. The sum of all contributions within a group is multiplied by and then divided equally amongst all group members regardless of their strategies. In a group containing players, of which cooperate, the resulting payoffs for cooperators and defectors are thus
[TABLE]
respectively. Evidently, the payoff of a defector is always larger than the payoff of a cooperator, if only . With a single parameter, the public goods game hence captures the essence of a social dilemma in that defection yields highest short-term individual payoffs, while cooperation is optimal for the group, and in fact for the society as a whole. The overall payoff of a player from all the groups is simply the sum .
Monte Carlo simulations of the described public goods game are carried out as follows. Initially each player on site of the square lattice is designated either as a cooperator () or defector () with equal probability. The following elementary steps are subsequently repeated in a random sequential manner. A randomly selected player plays the public goods game as a member of all the groups, thereby obtaining the payoff . Next, one of the four nearest neighbors of player is chosen uniformly at random, and this player acquires its payoff in the same way. Finally, player copies the strategy of its randomly chosen nearest neighbor with the probability determined by the Fermi function
[TABLE]
where quantifies the uncertainty by strategy adoptions Szabó and Tőke (1998); Szolnoki et al. (2009). In the limit, player copies the strategy of player if and only if . Conversely, in the limit, payoffs seize to matter and strategies change as per flip of a coin. Between these two extremes players with a higher payoff will be readily imitated, although the strategy of under-performing players may also be occasionally adopted, for example due to errors in the decision making, imperfect information, and external influences that may adversely affect the evaluation of an opponent. Repeating all the described elementary steps times constitutes one full Monte Carlo step (MCS), thus giving a chance to every player to change its strategy once on average.
As the main observable, we determine the average fraction of cooperators as
[TABLE]
where if and otherwise, and indicates average over time in the stationary state. A sufficiently long relaxation time needs to be discarded prior to this. In general, the stationary state is reached once becomes independent of the time interval over which it is determined.
IV Parallelization of the Monte Carlo simulation method
While the implementation of the Monte Carlo simulation method described in Section III is straightforward, its parallelization requires care in that we have to make certain that threads that will process different parts of the lattice do not simultaneously change strategies of the same players, and that the strategies of players do not change whilst the determination of the payoffs takes place. The remedy lies in partitioning the lattice as shown in Fig. 2, which was originally proposed in Shim and Amar (2005) for parallel kinetic Monte Carlo simulations of thin film growth, and subsequently used also in Kelling et al. (2012) for simulating the Kardar-Parisi-Zhang model and the kinetic Monte Carlo model. The approach is known as the double tiling decomposition scheme, effectively partitioning the lattice into equally sized and independent domains of type , , and (see Fig. 2), which during a full Monte Carlo step should be updated in turn. Accordingly, gives us the number of threads needed in total within the graphics processing unit, with each thread being assigned to one domain. Note that a full Monte Carlo step is split into four parts, the first part updating players within all domains , the second part updating players within all domains , the third part updating players within all domains , and finally the fourth part updating players within all domains . But since all the domains of a given type are independent in that they do not share a player or even a border, they can be updated in parallel by threads.
However, by looking at the schematic display of groups in Fig. 1, it becomes clear that a randomly selected player at the border of a particular domain will need some information of player strategies also from adjacent domains. Even more so if the potential source of the new strategy, player , will be selected on the other side of the border. We remind that player can be any of the four nearest neighbors, and a player at the border of the domain will have one player as the nearest neighbor in the other domain (players in the corners will in fact have two nearest neighbor in two other domain). The memory that needs to be passed to a particular thread must thus contain not only the strategies within a domain, but also the strategies of players three lines outward in each direction. This is schematically illustrated in Fig. 3. The reader can convince herself that this is in fact the case by following the composition of groups depicted in Fig. 1 along the border of a particular domain whilst assuming that player is chosen from the other side of the border. Thankfully, this requirement does not interfere with independent parallel random sequential updating in the other domains of the same color in Fig. 2 if only the size of the domains is or larger.
This is essentially all there is to the parallelization. Technically, we thus split the lattice into adjacent domains of four different types, update all domains of the same type in parallel, and do so consecutively over all the type to complete one full Monte Carlo step. The only technical hiccup is to make sure information on player strategies is available also three player lines outward in each direction. In our particular case, we have chosen the domain size to be (exactly as displayed in Fig. 3), which together with the auxiliary memory still allows all strategy to be stored in the fast shared memory block if the number of threads within a block is . The latter choice, on the other hand, is conditioned on the fact that one streaming multiprocessor simultaneously executes a so-called warp of threads in the large majority of today’s graphic cards. Given these choices, the number of blocks within the grid can be determined according to . We note that, depending on , might not be an integer. In fact, it will be only when is exactly divisible by . In all other cases should simply be the smallest integral value not less than the originally calculated . In this case one ends up with more threads then needed, but it is easy to discard those with an statement. For example, for and the domain size , it is clear that the actual number of threads needed is . But with the “number of threads within a block” constrain, the number of blocks within the grid will be , in turn yielding threads available.
Lastly, we note that the selection of the domain size and the partitioning of the lattice linearly in sequences of ( or ) obviously requires that the linear size of the lattice be divisible by lest the decomposition will not be perfect. We again refer to Fig. 2 for details.
The source code that implements the above described parallelization is available as supplementary material to this paper as well as at github.com/matjazperc/pgg.
V Results
As explained in Section III when introducing the payoffs of the public goods game, if the payoff of a defector is always larger than the payoff of a cooperator. Accordingly, is the threshold that marks the transition between defection and cooperation in well-mixed populations, where groups are formed by selecting players uniformly at random. In structured populations, however, due to the so-called network reciprocity Nowak (2006b), cooperators are able to survive at multiplication factors that are well below the limit that applies to well-mixed populations. The manifestation of network reciprocity relies on pattern formation, such that cooperators form compact clusters and can thus avoid exploitation by defectors Nowak and May (1992). In short, cooperators do better if they are surrounded by other cooperators.
Previous research has shown that for the spatial public goods game on the square lattice with the threshold for cooperators to survive is Szolnoki et al. (2009). Also importantly, it was shown that the public goods game on the square lattice exhibits continuous phase transitions that belong to the directed percolation universality class Helbing et al. (2010), such that
[TABLE]
where is the critical parameter value at which the absorbing phase is reached and is the critical exponent Hinrichsen (2000). We remind the reader that the directed percolation universality class conjecture requires that Janssen (1981); Grassberger (1982) (i) the model displays a continuous phase transition from a fluctuating active phase into a unique absorbing phase, (ii) the transition is characterized by a positive one-component order parameter (in our case ), (iii) the dynamic rules involve only short-range processes (yes due to the square lattice), and (iv) the system has no special attributes such as additional symmetries or quenched randomness.
We can use this conjecture that fully applies to the presently studied public goods game as validation for the parallelization of the Monte Carlo method presented in Section IV. As shown in Fig. 4, the phase transition leading from the mixed phase at to the absorbing phase as decreases is indeed continuous (see inset), and it belongs to the directed percolation universality class with and (main panel). This confirms the previously published results in Szolnoki et al. (2009); Helbing et al. (2010) obtained with traditional CPU-based computing, and it also confirms the applicability of the directed percolation universality class conjecture to the public goods game on the square lattice. For the results presented in Fig. 4, we have simulated the public goods game on lattices of size in the immediate proximity of the phase transition point up to full MCS, subsequently using smaller lattice sizes and shorter simulations times when being further away from . With CPU-based computing, such simulations typically require weeks to complete on clusters with many cores. As the inset of Fig. 4 shows, without using larger lattices, for example staying at the size where CPU-based computing is not significantly slower than parallel computing, the nature of the phase transition becomes distorted. We refer to page in Krauth (2006) for a classical statistical mechanics example of qualitatively the same phenomenon.
It remains of interest to quantitatively asses the advantage in time obtained by switching from CPU-based computing to parallel computing on the graphics processing unit. Results presented in Fig. 5 show that for CPU-based computing the simulation times increase roughly with the square of the increase of the linear size of the lattice . For example, if using the same number of full MCS, it takes approximately four times as much time to finish the simulation on a square lattice as it does to finish the same simulation on the square lattice. As the chosen linear size of the lattice grows, this factor of four grows ever so slightly towards five and above due to increasing memory demands. Note that at lattice size it takes in excess of five days for the CPU to complete full MCS. At lattice size it takes nearly a month. Conversely, with parallel computing the simulation times remain nearly constant as increases to the point where the number of threads does not exceed streaming multiprocessors present in the graphics card that we have used (NVIDIA GeForce GTX 1080). With the domain size the double tiling decomposition scheme, this yields a , beyond which the simulation times start increasing with a factor between three and five for each doubling of the linear lattice size . Accordingly, the larger the lattice size the greater the benefits from parallel computing.
This is confirmed in the inset of Fig. 5, where the ratio between parallel computing time and CPU-based computing time in dependence on is shown. We note that the ratio can also be smaller than one, in particular if the lattice is so small that the benefits of parallel computing can not be taken advantage of with the constrain of a domain size (we remind the reader that is the smallest permissible domain size to avoid prohibited memory overlaps between the domains in Fig. 2, as explained with Fig. 3). In that case the number of threads does not offset the slower clock speed of streaming multiprocessors in comparison to the CPU clock speed (Intel Xeon processor with 2.80 GHz clock speed). The dashed-dotted gray line in the inset at marks the lattice size at which CPU-based computing is equally fast as parallel computing. When , however, the acceleration of the simulations is impressive. At , the ratio is , thus cutting the simulation time for full MCS from a month to less than an hour and a half at this lattice size. Of course still somewhat too long for classroom demonstrations, but such large lattice sizes are rarely needed (see Szolnoki and Perc (2013) for an example). Taken together, results presented in Fig. 5 confirm that the effort in successfully parallelizing the Monte Carlo simulation method is rewarded with simulation times that can be orders of magnitude shorter than attainable with conventional CPU-based computing.
VI Discussion
We have advocated for the use of graphics processing units to bring high-performance parallel computing into the physics classroom. We have used the public goods game on the square lattice as an example to demonstrate the parallelization of the Monte Carlo simulation method by means of the double tiling decomposition scheme, and we have explained in detail the subtleties of memory sharing and conflict avoidances when simultaneously updating different parts of the lattice by several threads running in parallel at the same time. We have shown that the parallelization preserves all the most important properties of the public goods game related to statistical physics, in particular the continuous character of the phase transition and the directed percolation universality class to which the phase transition belongs. While the calculation of these results would require weeks of many-core clusters if done with traditional CPU-based computing, a single capable graphics card decreases this time by a factor of 500, thus making these phenomena viable for presentation in the classroom.
We note that the described parallelization scheme can be easily adapted so that Monte Carlo simulations of other evolutionary games on the square lattice can be performed, such as for example the well-known prisoner’s dilemma game Hauert and Szabó (2005); Javarone (2016) or other social dilemmas Szabó and Fáth (2007); Perc and Szolnoki (2010). An important difference with regards to the public goods game is that the prisoner’s dilemma game is played in a pairwise manner, not in groups. As a consequence, the memory needed to update strategies within a given domain is one line outward less in each direction (see Fig. 3), which can serve as good practice when adapting the source code. Of course, and as already demonstrated in original research published in the past Preis et al. (2009); Januszewski and Kostur (2010); Schulz et al. (2011); Kelling and Ódor (2011); Seibert et al. (2011); Barros et al. (2011); Kelling et al. (2012); Juhász and Ódor (2012); Polyakov et al. (2013); Ódor et al. (2014); Spiechowicz et al. (2015), the same approach can be used to simulate a broad variety of classical statistical physics systems.
With the Moore’s law, stating that the number of transistors in a dense integrated circuit doubles approximately every two years, arriving at it limits due to fundamental technical constraints in CPU design, today’s hardware has to be designed in a multi-core manner to keep up. This in turn means that if we want to benefit from faster simulation times, the software has to be written in a multi-threaded manner to take full advantage of the hardware. The graphic cards industry has invested admirable effort for this to materialize, one product of which is the friendly and thoroughly documented CUDA programming environment. The time is thus ripe for these programming techniques to be introduced into graduate and advanced undergraduate curricula, giving students the chance to learn the benefits of parallel computing from the onset of their physics education. We hope that this paper will be useful to that effect.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Atomic Heritage Foundation (2016) Atomic Heritage Foundation, Computing and the Manhattan Project ( atomicheritage.org/history , 2016).
- 2Newman and Barkema (1999) M. E. J. Newman and G. T. Barkema, Monte Carlo Methods in Statistical Physics (Oxford University Press, Oxford, 1999).
- 3Binder and Müller-Krumbhaar (1974) K. Binder and H. Müller-Krumbhaar, Phys. Rev. B 9 , 2328 (1974).
- 4Binder and Landau (1980) K. Binder and D. P. Landau, Phys. Rev. B 21 , 1941 (1980).
- 5Binder (1981) K. Binder, Z. Phys. B Condensed Matter 43 , 119 (1981).
- 6Binder (1997) K. Binder, Rep. Prog. Phys. 60 , 487 (1997).
- 7Hinrichsen (2000) H. Hinrichsen, Adv. Phys. 49 , 815 (2000).
- 8Ódor (2004) G. Ódor, Rev. Mod. Phys. 76 , 663 (2004).
