
TL;DR
This paper models adaptive learning in large populations using a PDE derived from Harley's rule, revealing that faster memory decay benefits evolution in certain 2x2 games.
Contribution
It introduces a PDE framework for stochastic adaptive learning in large populations, connecting behavioral rules to population dynamics.
Findings
Faster memory decay provides an evolutionary advantage in some games.
The PDE model captures the conservation of stimuli in behavior selection.
Analysis of 2x2 games shows conditions favoring different learning speeds.
Abstract
We consider the adaptive learning rule of Harley (1981) for behavior selection in symmetric conflict games in large populations. The rule uses organisms' past, accumulated rewards as the predictor for the future behavior, and can be traced in many life forms from bacteria to humans. We derive a partial differential equation (PDE) that describes the stochastic learning in a population of agents. The equation has simple structure of the `conservation of mass'-type equation in the space of stimuli to engage in a particular type of behavior. We analyze the solutions of the PDE model for typical 2x2 games. It is found that in games with small residual stimuli, adaptive learning rules with faster memory decay have an evolutionary advantage.
| A | B | |
| A | (ah, ah) | (dh, ch) |
| B | (ch, dh) | (bh, bh) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Adaptive Learning in Large Populations
Misha Perepelitsa
Department of Mathematics
University of Houston
4800 Calhoun Rd.
Houston, TX.
Abstract
We consider the adaptive learning rule of Harley (1981) for behavior selection in symmetric conflict games in large populations. This rule uses organisms’ past, accumulated rewards as the predictor for future behavior, and can be traced in many life forms from bacteria to humans. We derive a partial differential equation (PDE) that describes the stochastic learning in a heterogeneous population of agents. The equation has a structure of the conservation of mass type equation in the space of stimuli to engage in a particular behavior. We analyze the solutions of the PDE model for symmetric 2x2 games. It is found that in games with small residual stimuli, adaptive learning rules with larger memory factor converge faster to the optimal outcome.
keywords:
Adaptive learning , relative payoff sum , symmetric games
††journal: Journal Mathematical Biology
1 Introduction
The seminal paper Maynard Smith & Price (1973) introduced the concepts of game theory into the study of animal behavior to explain the evolution of behavioral traits. To this end, a notion of evolutionarily stable strategy (ESS) was developed to describe stable outcomes of natural selection, by defining it as being uninvadable by pure or mixed strategies players. A dynamical process leading to an ESS can be formalized by the following model. Consider a situation where each individual in a population consistently uses one of the available behavioral traits to interact with other members and no mutations take place. Assume the choice of the behavior (action) is inherited, and the population consists of groups that use a particular behavior. Over a long period of time involving large number of interactions, the average fitness per game for individuals using denoted by is compared to the population average fitness , and the individuals in -group are reproduced at rate proportional to If the changes in the frequencies of are approximately continuous Taylor & Joniker (1978), Zeeman (1981) derived the replicator dynamics equations for the frequencies of individuals acting according to
[TABLE]
and showed that ESS’s are the asymptotically stable fixed points of this system of equations.
Considered from the point of view of game theory, an evolutionarily stable strategy is a refinement of a Nash equilibrium, which describes an optimal choice of actions in games. This way, natural selection is a mechanism for implementing a rational decision-making in the evolution of species. There is another way by which organisms, even without complex cognition, can discover optimal actions. It can be achieved through their ability to regulate behaviors depending on the experience, in particular, through the tendency to repeat positive and avoid negative experiences. This is known as the law of effect, first formulated by Thorndike (1989), and generally accepted as one of the main paradigms of animal behavior, see for example, Ferster & Skinner (1957), Herrnstein (1961), Herrnstein (1970), Catania (1963), Chung & Herrnstein (1967), Domjan & Burkhard (1986).
The law of effect is the basis for the reinforcement learning models. They were introduced by Bush & Mosteller (1955), and since then have been applied to problems in such diverse areas as biology, economics, and engineering. Some representative examples of the extensive literature on this subject can be found in Harley (1981), Cross (1983), Roth & Erev (1995), Erev & Roth (1998), Sutton & Barto (1998), Sandholdm (2010), Nax & Perc (2015).
Interestingly enough, Börgers & Sarin (1997) demonstrated that models of learning also lead to the replicator dynamics equations similar to (1). For that, they considered repeated plays of a 2x2 game between two agents who adjust their probabilities for actions according to a reinforcement learning model of Cross (1983), and derived the replicator dynamics equation in the limit of small payoffs. More general situations, both in terms of the type of the games, the kind of reinforcement models and their relation to replicator-like equations are discussed in a monograph by Fudenberg & Levine (1998), Rustichini (1999), Fudenberg & Takahashi (2011), and Mertikopoulos & Sandholm (2016).
The law of effect can be expressed in many different ways, depending on which decision-making facilities are reinforced and on the specific rules of reinforcement. In the model of Cross (1983), it is the probability to play particular action that undergoes reinforcement. In an alternative model motivated by bio-chemical processes in neural circuits, Harley (1981) proposed the relative payoff sum (RPS) algorithm as a decision making mechanism. The RPS learning rule assumes the ability of organism to maintain a record of cumulative rewards from previous experiences, which at epoch is given by vector is the predictor for future behavior and can be interpreted as a vector of “motivations” or “stimuli” to engage in a corresponding action. From the current stimuli an agent computes the probability to play
[TABLE]
If action is chosen and it brings payoff which is assumed to be non-negative, then stimuli are updated according to the rule
[TABLE]
and for
[TABLE]
Positive parameter expresses memory effect: payoff from previous plays will appear with the weight in the expression for Parameter is some default level (residual) of stimuli For example, the residuals might represent the genetic preference for the action. A similar learning model was introduced by Roth & Erev (1995).
The goal of this paper is to address the question of behavior of RPS learning agents in large populations, where agents are randomly matched in pairwise encounters, i.e. learning in heterogeneous populations. Unlike the evolutionary dynamics framework, RPS learning agents can not be identified with some particular strategy they use all the time. The RPS rule will in general prescribe new strategy every time an agent plays.
In these situations, the natural quantity to describe the state of agents is the distribution of agents according to their current stimuli. In large populations, the probability density function of this distribution can be approximated by a continuous function and its changes can be described by a non-linear Fokker-Planck equation. The derivation of this equation, which generalizes the replicator dynamics equation for heterogeneous populations, is contained in the Appendix of this paper. The assumptions needed for the derivation of the equation are: large population size, large number of plays of the game, and incremental (infinitesimal) structure payoffs. This approach is a well-known method for modeling multi-agent systems in the problems of physics, biology, economics and sociology, see for example, Risken (1992) and Pareschi & Toscani (2014). A similar method has been used by Traulsen, Claussen & Hauert (2005, 2006), in the analysis of the evolutionary selection by Moran process.
After deriving the equations we apply them to determine the behavior of RPS learners in symmetric 2x2 games. It should be kept in mind, however, that the time asymptotic behavior based on the Fokker-Planck equation and the time asymptotic of the original discrete-time stochastic process are not, in general, the same. Moreover, the PDE model derived in this paper is a leading approximation of a continuous-time stochastic process. Thus, any statement claimed in this paper about the convergence of the system to a particular state should be understood as a statement that the system gets close to this state within the limits of the validity of the PDE approximation.
In game theory, the method of stochastic approximation by Benaïm & Hirsch (1999) is typically used to determine the long-time behavior interacting agents. The method relies on the stability analysis of an averaged (deterministic) system of ODEs. As the dimension of a system is of the order (the number of agents ) (the number of strategies per agent), even for small populations it is an extremely challenging task. The Fokker-Planck equation, on the other hand, is applicable for large populations, the trade-off being the loss of information about a particular player.
The analysis of the PDE model derived in this paper shows that for games with a single Nash equilibrium, the strategies of all agents converge to the dominant strategy, when the RPS rule has no memory factor, or with a memory factor and zero residuals. For learning models with a memory factor the convergence is faster than for the models with perfect memory. Additionally, the learning time in the former case varies inversely with the size of the memory factor. For games with a mixed Nash equilibrium, the learning process converges to a state in which the population mean probability equals to the equilibrium value. As the population mean probability approaches its limit, the “strength” of the learning decreases, and the individual probabilities do not, in general, converge to the equilibrium value. In this case, the population remains heterogeneous. Finally, if the memory factor is present and residuals are not zeros, agent’s strategies converge to some mixed strategy, for a generic 2x2 symmetric game.
2 The Model
We consider a series of plays of a game between randomly selected individuals in a large population. The payoff matrix of the game is given in the table 1. The behaviors are labeled A and B.
We will analyze the RPS learning in the games that a) have a single pure Nash equilibrium with (or ); b) two Nash equilibria with ; and c) mixed Nash equilibrium when In the latter case, there are also two asymmetric Nash equilibria and
Now, let there be a group of individuals where each individual is characterized by vector representing the accumulated stimuli to play A and B, respectively, at epoch Suppose that two agents and are selected at random to play the game. Agents play with the probabilities to cooperate and for agent and agent respectively. Denote the outcomes (A,A), (A,B), (B,A), (A,A), where the first is the action chosen by agent As the result of the interaction agent increments his/her states according to the rule
[TABLE]
and symmetrically for agent are the residual stimuli and the parameter of the fading memory. The former is related to the memory factor from the introduction by formula Then, time moves to the next epoch and the process is reiterated. At time agents have generally different, initial levels of stimuli to cooperate and defect.
2.1 Stimuli-space
Our main interest is in the distribution of agents in the stimuli-space described by the density function (PDF) In this space the straight lines through the origin represent the sets of stimuli of constant probability to cooperate when agent is playing a mixed strategy. The probability related to the slope as so that the stimuli with preference for C are located closer to the -axis. For any subset in the stimuli space, represents the proportion of agents with their stimuli in the set at time
The following equation is found to be a leading order approximation for the process, when parameters are small and the number of players is large.
[TABLE]
and velocity
[TABLE]
where
[TABLE]
The analysis of the equation can be understood from the behavior of the system of ODEs:
[TABLE]
and the equation for that follows from (5):
[TABLE]
The derivation of equations (5)–(8) is contained in the Appendix.
Velocity represents the rates of change of the stimuli of agents whose current state is given by The rates are proportional to the group average payoffs for corresponding actions, and “penalized” by memory for large deviations from the default residual levels Equation (8) is convenient for analyzing the games with pure Nash equilibrium. For the game with a mixed equilibrium, described by the frequency to play A, more convenient is equation
[TABLE]
The nonlinear equation (5) is the first order approximation of the stochastic, learning process. The next order approximation contains a diffusion term, with the diffusion coefficients of order The diffusion generally prevents the convergence of learning of the group to a single strategy (fixation). For example, the convergence of the group learning to a single strategy, based on equation (5), only indicates that the distribution of agents’ strategies gets close to that particular strategy, within the limits of validity of equation (5).
Now we consider the dynamics of learning in symmetric games. First we consider models with no memory factor,
2.2 Pure Nash equilibrium
Let and The characteristic property of this regime is the positive sign of in equation (8), for any distribution function This reflects the fact that action C is your best choice, no matter what your opponent does.
It is shown in Appendix that increases to its maximum value 1 and the support of function is transported to infinity along the trajectories of ODE (7). In particular the stimuli of all agents, for large values of will be located below any straight line of positive slope through the origin. That is, asymptotically all agents learn to play the equilibrium strategy “always A”. The inclusion the second-order effects does not change the asymptotic picture.
2.3 Mixed Nash equilibrium
Let and The equilibrium mixed strategy is and the group average probability to play A evolves according to equation (9).
It is shown in Appendix that the dynamics of equation (5) implies that converges to the equilibrium density The population average probability to play A asymptotically coincides with the probability at the Nash equilibrium. Unlike the pure Nash equilibrium case, in general, agents keep playing with different strategies. This can be seen from the following example. If the initial data describes the population of agents playing different strategies and is such that
[TABLE]
then the trajectories of the flow generated by are straight lines and for any
[TABLE]
Clearly, for all there is non-diminishing spread in the distribution function To put it differently, there is no learning when the group average probability to play A equals because players expected payoffs from A and B are equal.
2.4 Two Nash equilibria
For this type of game and Denote by As in the previous case, the group average probability to play A evolves according to equation (9). If the initial data velocity carries to values of stimulus values of much larger than As the average increases, the dynamics is consistent and leads to learning of the Nash equilibrium A. With learning converges to the other equilibrium. In the borderline case is unstable: stimuli increase along the straight lines, and agents retain their initial probabilities to play A and B, but any perturbations will deviate the system to A or B equilibrium.
2.5 Memory factor RPS
Consider now models with It can be seen from equation (6) that a sufficiently large box is invariant under the flow of (7). This can be seen from the sign of the velocity components. We show in Appendix that with residuals RPS learning will approach an asymptotically stable point in the stimuli space, for any positive payoff rates All agents will tend to play a mixed strategy When (A,A) is a Nash equilibrium, the ratio of stimuli and agents favor action A after learning more than at their default levels.
There is an interesting limiting case of zero residuals For such RPS models, when the game has (A,A) as the single Nash equilibrium, all agents learn this optimal strategy, as function converges to a delta mass supported at the point
In this case, the equation for mean is closely approximated by the replicator dynamics equation (11) given below. The factor on the right-hand side of that equation, in this case, is of order 1. The convergence to the optimal strategy is faster than in the case of learning with perfect memory (), for which the factor is of the order
Moreover, among the models with a memory factor, the learning period appears to be shorter for larger values of see figure 1 and explanations in Appendix.
Thus it appears that RPS models with small residuals and large values of should be preferred by natural selection. In such models agents act predominantly on the basis of the last few payoffs. In this context it is worth mentioning that one of the postulates of prospect theory of Kahneman & Trverski (1984), is the statement that people actions (in games with monetary payoffs) are directed by the increase in their total wealth, rather than the total accumulated wealth, which shows a tendency to use short memory and suggests that RPS learning might be at work.
2.6 Relation to the Replicator Dynamics Equation (RDE)
If one postulates that all agents have the same, or approximately the same, stimuli
[TABLE]
for some so that is represented by a delta function supported at then equation (5), leads to a variant of the replicator dynamics equation for the probability to cooperate
[TABLE]
Notice the positive factor on the right-hand side of the equation. For a learning processes in which stimuli increase the learning rate slows down. The extent to which hypothesis (10) is consistent with the dynamics of (7) is limited only to the cases when the latter has a single asymptotically stable fixed point.
3 Appendix: a PDE model
3.1 Fokker-Planck equation
Consider a group of individuals acting according to RPS learning rule described in section 2. Let represent the vector of pairs of stimuli for all members at epoch Each component of this vector is 2-dimensional: By where we denote PDF for distribution of We will write where each The probability to play A will be denoted as
Suppose that member and are selected for the interaction. There will be only one game played during the period from to The matrix of payoffs is described in table 1. The range of parameters will be restricted later on.
Conditioned on the event the agent probabilities for the next period are set according to the RPS rule (4), which in the notation of the stochastic process are
[TABLE]
and symmetrically for For all other agents, for The definition of makes it a discrete-time Markov process. We proceed by writing down the integral form of the Chapman-Kolmogorov equations and approximate its solution by a solution of the Fokker-Planck equation (forward Kolmogorov’s equation), for small values of and large
Change of from to can be described in the following way.
[TABLE]
This equation can be written in slightly different way:
[TABLE]
The above equation can be used to obtain -dimensional ODE approximation of the stochastic process by evaluating This approach was implemented in the method of stochastic approximation developed by Benaim-Hirsch (1999) and applied to the study of convergence of stochastic fictitious play processes. The method guarantees the convergence of the process under certain stability conditions for the dynamics of the associated ODE.
The large dimension of that dynamical system is an obstacle for further analysis. In contrast, we would like to obtain an equation for the distribution of large number of agents in 2-dimensional stimuli space. For this, denote the PDF of the distribution by
[TABLE]
where is a dimensional vector of all coordinates, excluding In statistical physics this function is also called one-particle distribution. In the formulas to follow we need to use two-particle distribution function
[TABLE]
where is the dimensional vector of all coordinated excluding and Function is symmetric in and is related to by the formulas
[TABLE]
The moments of function and are computed from the moment of
[TABLE]
and
[TABLE]
This follows from the definition of these functions.
Now we use (13) to obtain an integral equation of the change of function For that select sum over and take average. We get
[TABLE]
The right-hand side can be conveniently expressed in terms of the two-particle function
[TABLE]
where and and similar for In the processes with large number of agents and random binary interactions, two-particle distribution function can be factored into two independent distributions:
[TABLE]
With this relation, (15), becomes a family of non-linear integral relations for the next time step distribution Taking the Taylor expansions up to the first order for the increment of the test function we obtain integral equations:
[TABLE]
where
Combining various terms on the right-hand side of the equation we get
[TABLE]
with
[TABLE]
[TABLE]
By integrating the right-hand side by parts, and assuming that are small and is large, in such a way that we obtain the Fokker-Planck equation:
[TABLE]
where and the drift velocity is given by the formula
[TABLE]
where
[TABLE]
Equation (8) is obtained from (16) by multiplying it by and integrating by parts. We’re assuming here that the support of is contained in the interior of the first quadrant, so that is zero on the boundary. That is to say that all agents play mixed strategies. This is a natural hypothesis, as nothing else can be learned if an agent chooses an action with certainty.
Consider now the learning from playing a symmetric game. Let the payoff to playing action against be equal to Denote the stimulus vector and the population average probability to play by
[TABLE]
The first-order approximation of the RPS learning process is given by the Fokker-Planck equation
[TABLE]
on the domain with In this equation, the velocity vector is given by its components
[TABLE]
Now we consider in some detail the learning in 2x2 games. Much of the analysis of equation (16) is derived from the behavior of trajectories of ODE (7). The solution of (16) is obtained by transporting the support of along trajectories of the dynamical system (7) and changing the values so that the “mass” (measured by the density function ) of any “fluid element” remains constant. In fact on can write down the formula for in terms of and prove that the solution of the non-linear problem exists and unique. This can be done by standard methods of PDEs, but it is outside of the scope of the present paper. Here, we will be interested in the long time, qualitative asymptotic for
Equation (16) is considered in the first quadrant For the boundary of the domain in invariant under the flow of (7). For the model with fading memory, the velocity at the boundary is directed into the flow domain. In either case, we will assume that the function is zero on the boundary. Then, this property will hold for all Additionally, in all of the analysis below, we assume that is a continuously differentiable function with compact support (zero outside some bounded set).
Consider the case of the pure Nash equilibrium () and no memory effects, The velocity
[TABLE]
where matrix
[TABLE]
For any the origin is an unstable node with two positive eigenvalues; the eigenvalue corresponding to –direction is the dominant one. From the phase portrait of the ODE it is clear that the flow transports the support of into the region where which correspond to the case of all agents asymptotically in adopting choice C in the game.
In the case of the mixed Nash equilibrium () and no memory effect, the origin is an unstable node. When then two positive eigenvalues coincide, and all trajectories are straight lines through the origin. In general, however, if for example they are not equal at time In such cases equation (9) can be used to show that Let Then, according to (9), for all and converges to provided that
[TABLE]
diverges. Notice also, that the derivative of ratio along a flow trajectory equals
[TABLE]
Thus, if at the support of is strictly inside the first quadrant, then there such that for all points in the support of for all later times. In particular, one can estimate
[TABLE]
Finally,since is uniformly bounded, i.e., for any for some then for any in the support of there a constant such that From this and (19) it follows that
[TABLE]
for some constant and so, the integral in (18) is infinite. The case follows by similar arguments.
Consider now the model with the memory decay when and residuals For any value of and any set of positive parameters of the game the right-hand side of (7) has a steady state in the interior of the first quadrant, with and this point is an asymptotically stable node. The other steady state is the origin, which is an unstable node. The support of moves in the direction of the stable interior point, contracting in size. When it becomes sufficiently small, the dynamics can be effectively approximated by ODE: where the new velocity
[TABLE]
In a long run the fixed point will settle at the stable, interior state and will be a delta-function supported at that point. In this process the agents learn to play A with probability
A special case of zero residuals deserves a discussion. In the limit of zero residuals velocity (for any fixed ) has three fixed points: and When the first, corresponding to the strategy “always A”, is an asymptotically stable node, the second, corresponding to “always B”, is a saddle, and the origin is an unstable node. One can compute that on any trajectory of the velocity field inside the first quadrant,
[TABLE]
Thus, the population average probability to play A, increases to 1, the stable stationary point converges to and the support of moves toward point The agents with memory decay and zero residual levels do learn the optimal strategy. Moreover, because the learning occurs in the bounded region of the stimuli space, the convergence to the equilibrium is faster than the case of learning with perfect memory Using equation (11) we can also estimate on the rate of convergence as a function of The rate is proportional to implying that the characteristic time of convergence to be Figure 1 shows the simulations of the stochastic process in this regime for different values of It shows that the prediction based on the model (16) is in good agreement with the stochastic learning process. We conclude that models with low residuals and high perform better in learning the optimal strategy in 2x2 games and thus more likely to evolve.
3.2 Second order effects
The inclusion of the next order approximation into consideration adds a diffusion term into equation (16) with the diffusion coefficients proportional to In the problems where the drift velocity takes to regions with large values of as in the case of the pure or mixed Nash equilibrium, diffusion will have a marginal effect. In the problems with asymptotically stable fixed points in stimuli space (short memory models) diffusion will create a stationary distribution of near the fixed point, preventing all agents to adopt a single strategy.
The author wishes to thank the anonymous referees for patient reading of the manuscript and detailed comments that helped to improve it in so many ways.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Benaim, M. & Hirsch, M.W. (1999). Stochastic approximation algorithms with constant step size whose average is cooperative. The Annals of Applied Probability, Vol. 9, No. 1, 216–241.
- 2[2] Borgers, T. & Sarin, R. (1997). Learning through reinforcement and replicator dynamics. J. Economic Theory, 77,1, 1–14.
- 3[3] Bush, R. R. & Mosteller, F. (1955). Stochastic models for learning. New York: Wiley.
- 4[4] Catania, A.C. (1963). Concurrent performances. A baseline for the study of reinforcement magnitude. J. Exp. Anal. Behavior, 6, 299–300.
- 5[5] Chung, S.-H. & Herrstein, R.J. (1967). Choice and delay of reinforcement. J. Exp. Anal. Behavior, 10, 67–74.
- 6[6] Cross, J. G. (1983). A theory of adaptive economic behavior. Cambridge: Cambridge University Press.
- 7[7] Domjan, M. & Burkhard, B. (1984). The principles of learning and behavior. Brooks/Cole Publishing Co. Monterey, California.
- 8[8] Erev, I. & Roth, A. E. (1998). Predicting how people play games: reinforcement learning in experimental games with unique, mixed strategy equilibrium. American Econ. Review 88, 848–881.
