Reinforcement with Fading Memories
Kuang Xu, Se-Young Yun

TL;DR
This paper analyzes how imperfect memory affects decision-making in stochastic sequential tasks, deriving formulas for steady-state choices and highlighting the importance of the update-to-decay rate ratio.
Contribution
It provides closed-form solutions for the steady-state choice distribution under large memory span and elucidates the critical role of update and decay rates in decision accuracy.
Findings
When update rate exceeds decay rate, the agent nearly always chooses the optimal action.
If decay rate exceeds update rate, choices are proportional to reward rates.
The model offers insights into decision-making with fading memories in stochastic environments.
Abstract
We study the effect of imperfect memory on decision making in the context of a stochastic sequential action-reward problem. An agent chooses a sequence of actions which generate discrete rewards at different rates. She is allowed to make new choices at rate , while past rewards disappear from her memory at rate . We focus on a family of decision rules where the agent makes a new choice by randomly selecting an action with a probability approximately proportional to the amount of past rewards associated with each action in her memory. We provide closed-form formulae for the agent's steady-state choice distribution in the regime where the memory span is large (), and show that the agent's success critically depends on how quickly she updates her choices relative to the speed of memory decay. If , the agent almost always chooses the best action,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuction Theory and Applications · Game Theory and Applications · Experimental Behavioral Economics Studies
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
Reinforcement with Fading Memories
Kuang Xu
Graduate School of Business
Stanford University
Se-Young Yun
Industrial & Systems Engineering
KAIST
Abstract
We study the effect of imperfect memory on decision making in the context of a stochastic sequential action-reward problem. An agent chooses a sequence of actions which generate discrete rewards at different rates. She is allowed to make new choices at rate , while past rewards disappear from her memory at rate . We focus on a family of decision rules where the agent makes a new choice by randomly selecting an action with a probability approximately proportional to the amount of past rewards associated with each action in her memory.
We provide closed-form formulae for the agent’s steady-state choice distribution in the regime where the memory span is large (), and show that the agent’s success critically depends on how quickly she updates her choices relative to the speed of memory decay. If , the agent almost always chooses the best action, i.e., the one with the highest reward rate. Conversely, if , the agent chooses an action with a probability roughly proportional to its reward rate.111September 2017; revised: September 2019. An extended abstract of this paper appeared in ACM SIGMETRICS, 2018.
Keywords: memory, stochastic model, queue, Markov process, fluid model, reinforcement.
1 Introduction
Memories serve as a crucial link between the past and future in dynamic decision-making problems, allowing subsequent decisions to draw on earlier experiences. However, they are hardly perfect in reality: humans routinely rely on faulty memories when making choices, and a firm’s time-varying states, such as an active user base that is prone to unpredictable attrition, could sway its strategic focus. This raises the question: to what extent can we make good decisions despite having an imperfect memory? In this paper, we investigate this question in the context of a stochastic action-reward model. While the subject can be approached from many different angles, we will focus on understanding the interplay between the rate of memory decay and the rate of decision updates; the former a measure of the quality of memory, and the latter a proxy of the decision maker’s temporal adaptivity. Our main results demonstrate that the relative magnitude between the two rates can have profound performance implications.
Let us begin with an informal description of our model (Figure 1). An agent operating in continuous time makes choices among a finite menu of actions. When action is chosen, the agent accrues discrete rewards according to a Poisson process with rate , while a unit of reward “expires” and disappears from her memory after a random amount of time that is exponentially distributed with rate . Opportunities for the agent to make new choices arise according to a Poisson process with rate . We will restrict our attention to a simple family of decision rules that the agent may use, referred to as the reward matching rule, which reinforces actions with more positive past stimuli in proportional manner: she randomly samples an action such that the probability of choosing action is proportional to , where is the total units of rewards accrued through action that have yet to expire, and an exploration parameter representing the agent’s minimal willingness to experiment with any action. If is set to 0, then the reward matching rule corresponds to the celebrated Luce’s linear probabilistic choice model (Luce, 1959, Erev and Roth, 1998), which has been extensively studied in behavioral economics (Erev and Roth, 1998, Beggs, 2005) and evolutionary biology (Harley, 1981). We are interested in the probability that the agent chooses a particular action in steady state. We next examine two illustrative examples, which also serve as motivating applications for the more stylized model we study.
Example 1 - Consumer Choice Modeling. We may think of the agent as a consumer, the actions as products or service providers, and the opportunities to choose new actions as the times when the consumer is allowed to re-select her membership or subscription. The rewards act as positive experiences or impressions resulting from using a service, and the strength of each impression in the agent’s memory diminishes as time progresses. When an opportunity to renew the subscription arises, the agent chooses a new service offering in a manner that is biased towards the ones associated with more recallable impressions.
Example 2 - Dynamic Product Offering under Customer Attrition. In this example, we slightly refine the model by letting the departure rates of rewards vary across actions. Consider a firm who operates a number of service contracts (actions), e.g., mobile phone plans, and needs to decide which contract to offer new customers (rewards) in a series of promotion periods, where one contract is offered for each period. Customers arrive to the system and are subscribe to the service contract of the corresponding promotion period. Those who subscribe to contract type remain an active customer for an exponentially distributed amount of time with rate , before departing from the system. The departure rate reflects some underlying qualities of the contract . The reward matching rule corresponds to the service provider’s choosing, at the end of each promotion period, a new contract with probabilities roughly proportional to the number of active customers associated with each contract.
Motivation for studying the reward matching rule.
An implied objective of our model is that the agent would like to maximize the steady-state total amount of rewards in system. If all the parameters were known a priori, it is not difficult to see that this objective could be trivially achieved by choosing at all times the best action, i.e., one with the highest reward rate (Example 1) or the highest weighted reward rate, (Example 2). The reward matching rule, therefore, can be viewed as an intuitive heuristic for approximately solving this optimization problem in the absence of the knowledge of the parameters. While the rule is by no means the only plausible heuristic for this purpose, we believe that it has a number of advantages which merit a theoretical investigation such as the one carried out in this work:
Firstly, Luce’s rule, after which the reward matching rule is modeled, has a long history as a fundamental building block in psychology and behavioral economics (see Section 3 for more details). The reward matching rule hence serves a natural starting point for understanding other, potentially more complex, cognitive models. 2. 2.
Secondly, the reward matching rule is useful not only for modeling human choice behavior. It could also be a policy deliberately chosen by a decision maker, such as in Example 2. While a conscious decision maker could in principle adopt more sophisticated decision rules or even use external information storage to mitigate memory decay, the reward matching rule offers a simple and intuitive rule of thumb that can be substantially easier for a practitioner to understand and implement, and is applicable in settings where external information storage is not possible. 3. 3.
Last but not least, our results show that, under a large number of parameter values, the reward matching rule is in fact near optimal and induces a steady-state choice probabilities that heavily concentrate on the best action. Along with the simplicity and conceptual appeal, the near-optimality therefore further justifies studying the reward-matching rule.
Preview of main results.
Our main results provide simple, closed-form formulae for the distribution of the agent’s choice in steady state, in the regime where the agent’s memory span is large (). Using these formulae, we show that the agent’s success critically depends on how fast her choices are updated () relative to the rate of memory decay (), when is relatively small compared to the overall memory size. In particular, if ,222The notation represents . the agent nearly always chooses the best action, i.e., the one with the highest reward rate . In contrast, if , the agent splits her attention more smoothly among all actions, with frequencies roughly proportional to the respective ’s.
1.1 The Model
We now describe our model more formally, as is depicted in Figure 1. The system consists of an agent operating in continuous time, who makes choices by selecting from a set of actions, . We denote by the action chosen by the agent at time , and refer to as the choice process. When the for some , the agent accrues discrete rewards according to a Poisson process with rate , where each reward can be treated as a token of unit size. We refer to as the reward rate of action . We assume that there exists one reward rate that strictly dominates the rest, and, without loss of generality, we rank them in a decreasing order, so that .
The agent’s fading memory is modeled by the “expiration” of past rewards. Each reward is associated with a lifespan, an exponential random variable with rate drawn independently from all other aspects of the system, and a reward departs permanently after staying in the system for a time duration equal to its lifespan. We refer to as the rate of memory decay. For simplicity of notation, the majority of our exposition and proofs will focus on the case where is uniform across all actions, but our main results extend to the case with action-dependent memory decay rates in a straightforward manner; such a refinement will be discussed in Section 8.3.
For , we denote by the total units of rewards accrued from choosing action that have yet to depart by time , and we refer to as the recallable reward process. Note that at time the recallable rewards associated with action depart at an aggregate rate of , and arrive at rate , if , or [math], otherwise. We assume that the agent’s initial recallable rewards at time [math] is a bounded vector in .
The agent has the opportunity to make a new choice at a set of update points scattered across time. The update points are distributed according to an exogenous Poisson process with rate , which we refer to as the update rate, so that the time between two adjacent update points is an exponentially distributed random variable with mean , independent from all other aspects of the system. Depending on the application context, could be interpreted as the level of “activeness” or “savvyness,” as it modulates the frequency at which the agent updates her decisions according to the present state of her recallable-memories (cf. Examples 1 and 2 in the Introduction).
How does the agent make a new choice when an update point arises? In this work, we will focus on a family of choice heuristics referred to as the reward matching rules. We assume that at time the agent makes a new choice by sampling from the distribution
[TABLE]
where . Here, is a positive constant referred to as the exploration parameter, which captures the agent’s willingness to experiment with an action even if there is currently little or no recallable reward attached to it. The justification and background behind the choice of this decision rule will be further elaborated on in Section 3. The parameter can also have an effect on the trade-off between the system’s steady-state versus transient performance; the reader is referred to Section 8.1.1 for a discussion.
1.2 Performance Metrics and Scaling Regime
The main goal of this paper is to study the agent’s long-run behavior in the model described above, and a key quantity of interest is the steady-state distribution of the choice process, , i.e.,
[TABLE]
Unfortunately, the choice process is non-Markovian as its transitions are influenced by the recallable reward process in a non-trivial way, and this makes it difficult to obtain explicit expressions for its steady-state distribution. To circumvent this challenge, we will focus on the regime where the average lifespan of a reward, , tends to infinity, as follows. Fix the action set, , and the reward rates, . We consider a sequence of systems, indexed by , where the average lifespan of a reward in the th system is equal to , i.e.,
[TABLE]
We will denote by , , , and the corresponding quantities in the th system. Finally, as the rewards’ average lifespan, , tends to infinity, the total amount of recallable rewards in the system also scales as .333To see this, note that if the choice process were to stay in action all the time, then would evolve according to an queue with arrival rate and departure rate , leading to a steady-state expected total recallable reward of . For this reason, we will scale the exploration parameter linearly with respect to , by setting , where is a fixed parameter.
Remark. It is natural to ask whether by taking a sequence of systems where the memory decay rate approaches zero, we will end up with the trivial case where there is no memory loss at all (), which would have been at odds with our goal of studying imperfect memories. This not the case. Because we always look at the steady state of the system by taking the limit as for every system in the sequence, the limiting regime is non-trivial and is not equivalent to simply setting , as is evidenced by the existence of two distinct limiting steady-state distributions in our main theorems (This point is further elaborated on in Appendix A.) The limit of is intended as an approximation for a pre-limit system where is positive but small. Numerical results in Figure 4 and 5 suggest that our theorems provide a reasonably good approximation for as small as .
1.3 Notation
For , and denote and , respectively. For a vector , denotes the norm of , .
We use the notation and as a shorthand for processes and , respectively. For two random variables and taking values in , we use to mean that and have the same distribution. The expression means is stochastically dominated by , i.e., for all . For an event , we use to denote the random variable with the cumulative distribution function \mathbb{P}(X\leq x\,\big{|}\,\mathcal{A}), . For a stochastic process, , we denote by the random variable distributed according to the stationary distribution of , if it exists.
For two functions and , means that ; is defined analogously. When such a limiting notation is involved, the limit is assumed to be as the argument tends to , unless explicitly specified otherwise.
2 Main Results
We now present our main results, which provide exact expressions for the steady-state distribution of the choice process, as well as the expected value of the (scaled) recallable reward process, in the limit as . For the remainder of the paper, we will assume that the update rate satisfies either or , as .
Theorem 2.1** (Steady-State Choice Probabilities)**
Fix . The following limits exist
[TABLE]
Furthermore, their expressions depend on the scaling of , as follows.
(Memory-abundant regime)* Suppose that , as . If , then*
[TABLE]
If , then
[TABLE] 2. 2.
(Memory-deficient regime)* Suppose that , as . Then,*
[TABLE]
The next theorem characterizes the steady-state expected value of the recallable reward process, . For , define the scaled recallable reward process:
[TABLE]
Theorem 2.2** (Steady-State Recallable Rewards)**
Fix . For all , the scaled recallable reward process, , admits a unique steady-state distribution, and the following limits exist:
[TABLE]
Furthermore, their expressions depend on the scaling of , as follows.
(Memory-abundant regime)* Suppose that , as . If , then*
[TABLE]
If , then
[TABLE] 2. 2.
(Memory-deficient regime)* Suppose that , as . Then,*
[TABLE]
2.1 Implications of the Theorems
A prominent feature of Theorems 2.1 and 2.2 is that the results critically depend on how quickly the update rate scales relative to the rate of memory decay , but are otherwise insensitive to the exact expression of . This leads to a natural dichotomy of the system dynamics into what we called the memory-abundant and memory-deficient regimes, and we discuss below several notable properties of the two regimes.
Near optimality and winner-takes-all.
Suppose that the constant in the exploration parameter is substantially smaller than . Then, our results show that the best action attracts nearly all of the agent’s attention, if (memory-abundant regime). Specifically, in the limit as , the agent chooses action with a probability of , which is almost when is very small compared to . Recall from Section 1 that the optimal strategy for the agent, should she know all parameters of the problem, would be to choose action at all times. Therefore, in this regime, the reward matching rule is able to deliver near optimal performance.
Moreover, the system exhibits an interesting winner-takes-all phenomenon, whereby the degree to which the agent focuses on the best action is independent of the reward rates of all other actions. For instance, if we were to increase the reward rate of the second-best action, , the probability that the agent chooses action would stay unchanged (and close to 1), and that of choosing action would still be close to zero no matter how close is to . One consequence of this effect is that, if we imagine the actions as being “competitive” service providers vying for a consumer’s attention, then the winner-takes-all phenomenon can be interpreted as a form of extreme competition among the providers, where the best attracts nearly all of the consumer’s budget, even if its superiority compared to the second best is not significant.
Reward-rate-proportional allocation.
In contrast to the memory-abundant regime, an update rate that scales substantially slower than the memory decay rate will lead to an allocation of choices that is much smoother as a function of the reward rates (Figure 3 provides a graphical comparison of the qualitative difference between the agent’s behavior under the two regimes). In this case, the probability that the agent chooses action is proportional to , which, when is small, is essentially equal to the reward rate, . The agent’s behavior thus exhibits a certain weak reinforcement, where the better actions do attract more attention but only proportional to their respective reward rates, and the above-mentioned winner-takes-all effect disappears.
The intuition behind this reward-rate-proportional allocation can be roughly explained as follows. When the update rate is slow, by the time a new choice is to be made the agent would have forgotten the rewards associated with all actions other than her most recent choice. Nevertheless, she still exhibits a mild preference over actions with higher reward rates in steady state, because she does retain a good memory of the rewards of her most recent choice, , which can be shown to be proportional to . The reward rate dictates how likely the agent will make the same choices, which in turn leads to the reward rates’ approximately linear appearance in the steady-state distribution of the choice process.
Rapid updates lead to complete oblivion.
The theorems also show that, perhaps surprisingly, a slower update rate is not always bad. While it may be tempting to conclude that faster update rates always lead to better outcomes for the agent, this turns out to be true only when the exploration parameter is relatively small. In fact, as soon as grows beyond , in the memory-abundant regime, the agent becomes completely oblivious and chooses actions uniformly at random (Eq. (6)). In contrast, the agent is always biased towards the best action in the memory-deficient regime, albeit mildly, as long as (beyond which point the exploration parameter is so large that it trivially overwhelms even the best action). The true culprit behind the complete oblivion is a lack of patience: when the update rate is fast and the exploration parameter high, the agent tends to switch her choices rapidly, before she is able to collect enough rewards to “learn” the reward rate of any individual action. Consequently, in the long run, not a single component of the recallable reward, not even the best action, can distinguish itself by consistently rising above the exploration parameter, and hence complete oblivion becomes the only possible outcome.
Remark 2.3
Note that Theorems 2.1 and 2.2 have left out the regime where and scale at the same rate, i.e., when as . This has turned out to be a difficult regime to tackle, and an intuitive explanation for this difficulty is as follows. Our current proof technique exploits the fact that one of the two processes in and becomes asymptotic Markovian property as , which significantly simplifies the dynamics (see Section 4). Unfortunately, this is no longer the case when , as neither of the two processes and becomes asymptotically Markovian by itself, and the stochastic dynamics of the system remain complex despite taking the limit of . We may illustrate this phenomenon by the following back-of-the-envelope calculation: the probability that any individual unit of reward will depart from the system between two consecutive update points is on the order of . When , the number of reward departures between two update points forms a constant fraction of existing rewards and the same is true for the number of new rewards arriving during this period. Therefore, the changes in the recallable-reward process remain highly variable from one update point to the next, and they depend both on the process itself and the choice process, . In contrast, in the memory-abundant or memory-deficient regimes, the profile of the recallable rewards either changes very little between two update points (memory-abundant regime), or substantially but in a predictable manner (memory-deficient regime).
2.2 Numerical Results
While Theorems 2.1 and 2.2 only apply in the limit where the average reward lifespan tends to infinity, the simulation results in Figures 4 and 5 show that our theoretical results also provide fairly accurate predictions, quantitatively and qualitatively, in systems with a moderate, finite lifespan. In Figure 4, we fix , while varying the update rate, , and we observe that concentration on the best action intensifies as increases. Figure 5 shows example sample paths of the scaled recallable reward processes, . One can discern without difficulty that, even for a modest , the process in the memory-abundant regime (plots through ) stays close to a set of smooth “fluid” trajectories, whose invariant state (the flat portion of the trajectories) coincide with the limiting value of predicted in Eq. (10), while in the memory-deficient regime the process changes more abruptly over time. As will become clear in the sequel, this qualitative discrepancy is a direct consequence of the overall system dynamics being dominated by the evolution of either or , depending on the scaling of the update rate , a key fact that we will exploit in our proof of the main theorems.
3 Related Literature
The assumption of a reward’s lifespan being exponentially distributed is modeled after the exponential memory decay theory of Ebbinghaus (Ebbinghaus, (1964)), which posits that the recall probability of a past event decays exponentially as time passes. The reward matching decision rule (Eq. (1)) can be viewed as a generalization of Luce’s linear probabilistic choice rule (Luce, (1959)), originally created to model empirical observations where humans or animal subjects make choices with probabilities proportional to associated amount of past rewards or stimuli. Luce’s rule has been studied extensively in a variety of disciplines, ranging across cognitive science (Herrnstein, (1970)), behavioral economics (Erev and Roth, (1998)), and evolutionary biology (Harley, (1981)); more recently, in a different context, a similar proportional-sampling inference algorithm highly related to the Luce’s rule has been used in establishing the sample complexity of Bayesian active learning with privacy constraints (Xu, (2018)). Noteably, the celebrated work of Erev and Roth, (1998) demonstrates that a discrete-time version of Luce’s rule can be a powerful reinforcement learning model for explaining a player’s behavior in sequential games, and subsequent theoretical analysis showed that the empirical frequency of the best action under Luce’s rule converges to one in single player games (Rustichini, (1999), Beggs, (2005), and more recently, Mertikopoulos and Sandholm, (2016)). These results, however, assume the agent has perfect memory, while models involving memory decay have received relatively little attention. Beggs, (2005) analyzes a variation of the model involving “forgetting” as a means to speed up convergence to mixed equilibria, with a crucial feature that the forgetting be performed in a noiseless fashion by weighing distant past experience with a deterministic discount factor. As such, among other differences, the stochastic memory decay model that we adopt exhibits fundamentally different dynamics than the deterministic discount model; for instance, under our model there is a non-negligible probability that all past rewards could be entirely erased, while the deterministic version never truly forgets the past, no matter how distant.
The impact of limited memory to sequential decision-making has been examined in the statistics literature dating back to the seminal work of Cover and Hellman, (1970), Hellman and Cover, (1970), which proposed algorithms for solving multi-arm bandit and sequential hypothesis testing problems when the decision maker has access to only a finite number of bits of memory. In the queueing theory literature, Mitzenmacher et al., (2002) and Gamarnik et al., (2016) study load balancing problems with space-bounded routing algorithms. In contrast to the present paper, the memories in these models are assumed to be perfect and immune to random erasures.
The dynamic induced in our model is related to the celebrated Pólya’s urn scheme (cf. Pemantle, (2007)). In its basic form, one ball is drawn uniformly at random from an urn containing balls of different colors. The chosen ball is then returned to the urn along with a new ball of the same color, and the procedure repeats. Since the probability that a certain color being chosen is proportional to the number of balls of that color present in the urn, by viewing the balls as rewards and colors as actions, our model can be thought of as the urn scheme with the modifications where (a) each ball disappears from the urn after some random amount of time, corresponding to memory loss, and (b) the number of new balls added is random, whose distribution depends on the color, corresponding to the variations in the reward rates. Note that in Pólya’s urn scheme the total number of balls in the urn eventually tends to infinity as the number of draws increases, while in our model the recallable rewards always stay a finite random variable as a result of memory loss. As such, our model can be viewed as a stationary variation of the urn scheme, which admits a very different dynamic.
On the technical end, our analysis of the memory-abundant regime relies on a certain fluid model to approximate the evolution of the original, stochastic recallable reward process. While the use of fluid models is a popular technique in the literature of queueing networks (cf. Kurtz, 1970, Bramson, 1998, Tsitsiklis and Xu, 2012, Massoulié and Xu, 2018), our work departs from conventional fluid models in a notable way: the recallable reward process is not Markovian, and the analysis hence must take into account the dynamics of the auxiliary choice process. It is worth noting that there has been a large-deviation theory developed for obtaining fluid-like approximation results where the transition rates of a continuous-time jump process are modulated by a finite-state auxiliary chain (cf. Theorem 1 of Mitzenmacher et al., (2002), which is based on Theorem 8.15 of Shwartz and Weiss, (1995)). Unfortunately, these results do not apply in our model for two reasons. First, they require the transition rate of the jump process of interest to be bounded, a condition not met here due to the aggregate departure rate of the rewards being unbounded over the state space. Second, the existing results typically have one scaling parameter, , whereby time is sped up by a factor of , and space scaled down by a factor of . We use a similar scaling for the recallable reward process, , while the update rate serves as a second scaling parameter that is not captured by the model in Mitzenmacher et al., (2002), Shwartz and Weiss, (1995). The addition of is non-trivial: in fact, if is too small, the fluid approximation does not hold, and a different (memory-deficient) regime arises. The dynamics of our model in the memory-abundant regime are also related to the so-called averaging principle in establishing fluid approximation for Markov processes, where the evolution of a Markov process at a slow time scale is influenced by average behavior of a modulating process at a faster time scale (cf. Perry and Whitt, (2011), Mandelbaum et al., (1998), Evdokimova et al., (2018), Coutin et al., (2010); see also Darling and Norris, (2008) for an overview.) However, in these models the underlying slow process is Markovian (e.g., queue lengths) and the modulating process a function of its value (e.g., the differences in queue lengths in Perry and Whitt, (2011)). In contrast, the recallable reward process in our model is not Markovian due to the existence of the choice process. The non-Markovian nature of our model means that certain martingale properties typically used to establish convergence do not directly follow from existing results. As a result of these aforementioned differences, we will develop our fluid approximation results starting from first principles using arguments developed by Kurtz, (1978), involving elementary properties of continuous-time martingales and certain quasi-fluid processes.
4 Proof Overview
The remainder of the paper is devoted to the proof of Theorem 2.2; the proof of Theorem 2.1 is a straightforward extension of Theorem 2.2 and will be given in Section 7. We discuss in this section some high-level ideas before delving into the details. Our proof consists of two main components corresponding to the memory-abundant and memory-deficient regimes, respectively. The system dynamics turn out to be quite different depending on the scaling of , and as a result, so are our proof techniques. Fix , and define the joint reward-choice process
[TABLE]
It is not difficult to check that the joint process is Markovian, while, for any fixed , the component processes and depend on one another and are not individually Markovian (see Figure 2). However, a key observation is that one of the two component processes becomes asymptotically Markovian as :
(Memory-abundant) If as , a large number of choice updates takes place before substantial change occurs in the recallable reward process. As a result, the recallable reward process, , becomes a sufficient Markov representation of the system dynamics, as . 2. 2.
(Memory-deficient) If as , the interval between two consecutive update points is large. By the time a new choice is to be made, the agent will have “forgotten” nearly all of the rewards associated with the actions other than the most current choice, and the rewards associated with the current choice will have become quite predictable. Here, the choice process becomes asymptotically Markovian as .
Our proofs for the two regimes will be tailored to the dichotomy outlined above.
For the memory-abundant regime, our analysis focuses on characterizing the scaled recallable reward process, , and uses a certain fluid model to show that, as , the evolution of is well approximated by the solution to a system of ordinary differential equations (ODE), and its stationary distribution converges to the unique invariant state of the ODEs. This portion of the proof will be presented in Section 5. 2. 2.
For the memory-deficient regime, we turn instead to the choice process, , and use its asymptotic Markovian property to obtain an explicit formula for its steady-state distribution, in the limit as . We then leverage this formula to calculate the expected steady-state recallable reward using the stationarity properties of the continuous-time processes. The proof for this portion of the theorem is presented in Section 6.
5 The Memory-abundant Regime
We establish the first claim of Theorem 2.2 in this section, and we will do so by actually proving a stronger statement, that the steady-state distribution of converges in to the corresponding expressions in Eqs. (10) and (11), as . As was alluded to earlier, our main tool relies on certain fluid solutions, defined as the solutions to a set of ordinary differential equations, whose dynamics will be shown to closely approximate that of when is large. We begin by introducing the fluid solutions and some of their basic properties.
5.1 Fluid Solutions and Their Basic Properties
Define the function , where
[TABLE]
Definition 5.1** (Fluid Solutions)**
Fix . A continuous function, , is called a fluid solution with initial condition , if it satisfies the following:
; 2. 2.
for almost all , the function is differentiable along all coordinates, with
[TABLE]
where is defined in Eq. (14).
We say that is an invariant state of the fluid solutions, if by setting , we have that for all .
The first term on the right-hand side of Eq. (15) corresponds to the instantaneous arrival rate of the rewards, where is the probability that action is chosen given the current states of recallable rewards, and the reward rate for that action. The second term corresponds to the instantaneous departure rate of rewards, which is proportional to the amount of recallable rewards associated with action . The following lemma states some basic properties of the fluid solutions. The proof is given in Appendix D.1.
Lemma 5.2
For all , there exists a unique fluid solution with initial condition . Furthermore, for all , is a continuous function with respect to .
The next result states that the fluid solutions admit a unique invariant state. Note that the expressions of the invariant state coincide with those in Eqs. (10) and (11) in Theorem 2.2.
Theorem 5.3
Fix . The fluid solutions admit a unique invariant state, , whose expressions depend on the value of , as follows.
Suppose . Then,
[TABLE] 2. 2.
Suppose . Then,
[TABLE]
*Proof. * It can be easily verified that the expressions in Eqs. (18) and (19) are valid invariant states in their respective regimes of values. We next show that they are indeed the unique invariant states in both cases. Let be an invariant state of the fluid solutions. Then, must satisfy:
[TABLE]
Define a partition of into and , where:
[TABLE]
Note that by definition,
[TABLE]
Suppose that . We next show that in this case is the singleton set, . First, suppose that is empty, that is, for all . We have that
[TABLE]
where the last inequality follows from the assumption that . This leads to a contradiction, implying that . Fix . Combining Eqs. (20), (22) and (23), we have that
[TABLE]
which, after rearrangement, can be written as
[TABLE]
Note that in Eq. (26) only the term on the right-hand depends on . This implies that, if , then all coordinates in must have the same arrival rate of rewards. Thus, we can assume that there exists , such that , for all , and it suffices to show that .
For the sake of contradiction, suppose that , and hence . This implies that , and hence . Invoking Eq. (20) again, we have that
[TABLE]
where step follows from Eq. (26), and from the fact that . Clearly, Eq. (27) contradicts with the fact that . We thus conclude that , , and . By setting in Eq. (26), we further conclude that
[TABLE]
Now, fix . We have, from Eq. (20), that
[TABLE]
where the last equality follows from the fact that
[TABLE]
Combining Eqs. (28) and (29), we have shown that the expressions in Eq. (20) are the unique invariant state of the fluid model, when .
We turn next to the second case of Theorem 5.3, and assume that . First, suppose that . By the arguments leading to Eq. (27), which do not depend on the value of , we know that , and
[TABLE]
where both inequalities follow from the assumption that . This contradicts with the assumption that and hence , and we conclude that must be empty. This implies that for all , and by Eq. (20), we have that
[TABLE]
This completes the proof of Theorem 5.3.
The following theorem is the main result of this section, which would imply the first claim of Theorem 2.2.
Theorem 5.4
Fix . The process admits a unique steady-state distribution, denoted by . Suppose that as . We have that
[TABLE]
where is the unique invariant state of the fluid model, and is the probability measure with unit mass on . Furthermore, we have that
[TABLE]
The remainder of Section 5 is devoted to showing Theorem 5.4 and consists of three main parts, as illustrated in Figure 6.
Theorem 5.5 of Section 5.2 shows that the scaled recallable-reward process converges uniformly to a fluid solution over any finite time horizon, as . This is the more technical portion of the proof, and the key idea is to use a quasi-fluid process, one where the evolution of the reward process is deterministic while the choice process remains random, as an intermediary to bridge the gap between the fluid solution and the original stochastic process. 2. 2.
Theorem 5.8 of Section 5.3 shows that starting from any bounded initial condition, , the fluid solution converges exponentially fast to the unique invariant state, as . The proof is centered around the evolution of a simple piece-wise linear potential function, which ensures that rapidly enters a suitable subset of the state space, from which point exponential convergence takes place. 3. 3.
Finally, in Section 5.4, by combining Theorems 5.5 and 5.8, we show that the sequence of steady-state scaled recallable rewards converges in to a unique limiting point concentrated on the invariant state of the fluid solution, completing the proof of Theorem 5.4.
5.2 Convergence to the Fluid Solution over Finite Horizon
In this subsection, we show that, with high probability, the scaled recallable reward process converges to the fluid solution uniformly over any finite horizon, as , as summarized in the following theorem (See plots through of Figure 5 for a sample-path example of how this convergence takes place.)
Theorem 5.5
Suppose that , as . Fix , and suppose that
[TABLE]
Let be the unique fluid solution with initial condition , as defined in Lemma 5.2. We have that, for all ,
[TABLE]
A main challenge in proving Theorem 5.5 is that the processes and interact in an intricate way: the recallable reward vector influences how the subsequent choices are chosen, while the current choice in turn dictates the arrival rate of new rewards associated with each action. The main idea of the proof is to “‘disentangle” their interactions by means of an intermediate, quasi-fluid process, where the dynamics of the arrivals and departures of rewards becomes deterministic and “fluid-like”, while the choice process remains random.
We now construct the quasi-fluid process. Fix and , and define
[TABLE]
In words, represents the rate of change in recallable rewards associated with action when the current recallable reward vector and action are and , respectively. Define the scaled action process:
[TABLE]
Note that, unlike , the scaling of only occurs in time and not in space. We then construct a quasi-fluid solution, , by letting
[TABLE]
In words, corresponds to a system in which recallable rewards are governed by deterministic arrival and departure processes, as specified by Eq. (37), and where the choice process is scaled in time by a factor of , but remains stochastic.
We note some basic properties of that will become useful. First, since is a Markov jump process and the function uniformly Lipschitz continuous, analogously to Lemma 5.2, it is not difficult to show that , expressed as a solution to the above integral equation, is uniquely defined almost surely. Also, it is not difficult to verify that
[TABLE]
where the right-hand side of the second inequality corresponds to the scenario where for all and the negative drift term is absent from the expression of .
The proof of Theorem 5.5 proceeds in two main steps. We first show that converges to the quasi-fluid process, , as , as is summarized in the following proposition.
Proposition 5.6
Fix , and suppose that
[TABLE]
Then,
[TABLE]
We then show that, when , the process converges to the fluid solution, :
Proposition 5.7
Suppose, in addition to the condition of Eq. (41) in Proposition 5.6, that , as . For all ,
[TABLE]
Note, in particular, that Proposition 5.6 holds independently of the scaling of , while Proposition 5.7 requires that the system be in the memory-abundant regime. Together, Propositions 5.6 and 5.7 imply Theorem 5.5.
The proof for Propositions 5.6 and 5.7 are given in Appendix C.1 and C.2, respectively. The main rationale in executing the proof of Theorem 5.5 via two steps is as follows. In Proposition 5.6, by fixing an arbitrary sample path of the choice process, , the arrivals and departures from become locally Poisson with a rate modulated by the current value of , and thus allowing us to establish a martingale property that will lead to the concentration of around the quasi-fluid solution (Eq. (39)). With Proposition 5.6, the only remaining randomness in the quasi-fluid solution is in the choice process alone, and in Proposition 5.7 we show that its limiting behavior can be captured by a purely deterministic fluid solution in the memory-abundant regime. This two-step procedure is to be contrasted with analyzing the discrepancy between and the fluid solution directly, a more often adopted approach in the literature: in our model, the non-Markovian nature of implies that such differences do not immediately form a martingale.
Remark. We believe that there could be an alternative approach to establishing Theorem 5.5 by analyzing the limiting behavior of a discrete-time embedded Markov chain of the reward process using stochastic approximation theory, and yet there are non-trivial difficulties along this direction. In particular, let be the th update point (when the choice process can be updated), then the embedded reward process can be shown to be a Markov chain, even though the original continuous-time process is not Markovian. One may then hope to invoke stochastic approximation theory to demonstrate that the scaled version of converges to a fluid solution, such as the approach taken in Beggs, (2005). There are however several difficulties. First, the jumps in can be unbounded due to the Poisson reward structure, while most standard results in the stochastic approximation literature require bounded updates between adjacent steps (e.g., Theorem 1 of Benveniste et al., (2012)); it is likely possible to remove this requirement but it will require a more delicate argument. Second, even assuming such convergence has been established, it would only apply to the update points themselves, and to obtain the type of uniform convergence over an entire continuous time interval in the form of Theorem 5.5 would require additional procedures to bound the maximal fluctuation of the reward process between event points. In light of these observations, we tend to believe that the approach adopted in this paper is more direct and easier to carry out.
5.3 Exponential Convergence of Fluid Solutions to Invariant State
In this section, we show that the fluid solution converges exponentially fast to the unique invariant state, starting from any bounded set of initial conditions.
Theorem 5.8
Fix . There exist , such that for all , ,
[TABLE]
Main idea. The main difficulty in the proof of Theorem 5.8 is that, depending on the initial condition, the individual coordinates of may not converge to their equilibrium state monotonically, as can be seen in the trajectory of in the first three plots of Figure 5. The main idea behind our proof is to obtain a more accurate description of with the aid of the following piece-wise linear potential function:
[TABLE]
By analyzing the evolution of , our proof shows that: (1) for every bounded initial condition of , the fluid solution rapidly enters a desirable subset of the state space, and (2) starting from this subset exponential convergence to the invariant state takes place.
While potential functions are often used to establish convergence results, to the best of our knowledge, the use of the potential function in Eq. (44) is new. This is largely due to the fact that our potential function is specifically designed to mirror the total “weights” across all actions under the reward matching rule. In addition, our use of the potential function also diverges from the their conventional role in proving convergence results: we do not use directly for characterizing the rate of convergence; the primary role of is for controlling the position of the trajectory in the state-space. Once we are assured that the trajectory has entered a suitable subset of the state space, the rate of convergence can be more readily obtained via elementary arguments.
Proof of Theorem 5.8. Fix , such that . For simplicity of notation, we will omit the dependence of the initial condition and write in place of , whenever doing so does not cause any confusion. The proof of the theorem is divided into two parts, depending on the value of . For the first part of the proof, we will assume that .
Let be the potential function defined in Eq. (44). We will first partition into three disjoint subsets, depending on the value of . Let
[TABLE]
and define
[TABLE]
We observe some useful properties of the fluid solutions.
Lemma 5.9
Fix so that all coordinates of are differentiable. The following holds:
If , then
[TABLE]
Furthermore, there exists a constant , such that if , then
[TABLE] 2. 2.
Denote by the coordinates over which is at least :
[TABLE]
There exists a constant , such that if , then for all , ,
[TABLE] 3. 3.
If , then
[TABLE]
*Proof. *Suppose that . We have that
[TABLE]
where step follows from the assumption that . In a similar fashion, now suppose that and hence . We obtain that
[TABLE]
This proves the first claim of the lemma.
For the second claim, fix and , . We have that
[TABLE]
where step follows from the assumption that , and from the fact that for all , and . This proves the second claim.
Finally, for the last claim, note that since , the fact that implies that at least one component of is no less than , and hence . We have that
[TABLE]
where the strict inequality in step follows from the definition of , the fact that is non-empty, and that for all . This completes the proof of Lemma 5.9.
We proceed by considering two cases for the initial condition of .
Case 1: . Claim 3 in Lemma 5.9 implies that
[TABLE]
By Claim 1 of the same lemma, we hence conclude that is non-decreasing in for all . By the same claim, we observe that is bounded from below by the constant whenever lies in . Since , we further conclude that in order for Eq. (56) to be valid, the amount of time spends in must satisfy:
[TABLE]
which, combined with Eq. (56), implies that for all :
[TABLE]
Claim 2 of Lemma 5.9 states that if , then any component of in will have a negative drift that is bounded away from zero. Fix , and suppose there exists . Since for all and , by Eq. (58), we have that
[TABLE]
Since the components of must be non-negative, the above equation implies that for all ,
[TABLE]
where
Eq. (60) will let us strengthen our previous observation. For all , we have that
[TABLE]
Since is non-decreasing in by the first claim of Lemma 5.9, we conclude that is non-decreasing in after . Combined with Eq. (57), this implies that
[TABLE]
where . Fix . Observe that by the definition of , we have that whenever . We have that
[TABLE]
and hence
[TABLE]
where step follows from the fact that . For step , note that since and , we have that
[TABLE]
The inequality in step thus follows from the fact that , for all , where we let .
Recall that, for , the solution to the ordinary differential equation is given by
[TABLE]
From Eq. (64), we conclude that for all ,
[TABLE]
where the last inequality follows from the fact that and for all , and hence . Let , and . Since is non-decreasing by Lemma 5.9, from Eqs. (62) and (67), we have that if , then
[TABLE]
We now show the convergence of for . Fix . Define
[TABLE]
and recall from Theorem 5.3 that , . By Eq. (60), we have that for all . Therefore, for all ,
[TABLE]
Using the Taylor expansion of the function around and noting that (Eq. (68)), we have that there exists , such that for all ,
[TABLE]
where , and we use the notation to mean . Given a fixed value of , Eq. (71) implies that the value of must lie between the solutions to the ODEs and , with initial condition , respectively. It can be verified that the solution to the ODE , with initial condition , is given by . Setting and , we thus conclude that there exist , such that
[TABLE]
Note that is always bounded. We thus conclude hat there exists such that,
[TABLE]
Since the above equation holds for all , together with Eq. (68), we have proven Eq. (43) for the first case, by assuming that .
Case 2: . We now consider the second case where the initial condition, , belongs to the set . Recall from Eq. (50) of Lemma 5.9 that when , all components of in exhibit a negative drift of magnitude at least . Letting , we thus conclude that one of the following scenarios must occur:
for some . 2. 2.
for all , and there exists such that
[TABLE]
For the first scenario, it is equivalently to “re-initializing” the system at time at a point in . The proof for Case 1 presented earlier thus applies and Eq. (43) follows. In what follows, we will focus on the second scenario. Without loss of generality, it suffices to show the validity of Eq. (43) by fixing an initial condition:
[TABLE]
Not that, by the definition of , this implies that
[TABLE]
We next show that for all . Fix such that
[TABLE]
Such exists because the set is open and is continuous. From Lemma 5.9, we know that if , then any component in must have a negative drift. It thus follows that
[TABLE]
We have that, whenever ,
[TABLE]
where step follows from the fact that for all , and from for all . In light of Eq. (66), the above inequality implies if the value of is ever to be below , then from that point on it will be bounded from below by an exponential function that is strictly positive. This shows that for all .
We now prove the exponential rate of convergence of to . Using again Eq. (79), we obtain
[TABLE]
where step follows from the fact that , and from for all . Using the reasoning identical to that in Eqs. (64) through (68), we conclude that there exists , such that , for all
Finally, the fact that implies that , and hence for all . By Eq. (78), this further implies that for any , for all . The exponential rate of convergence of for thus follows from the same argument as in Eqs. (70) through 73. This completes the proof of Theorem 5.8, assuming that .
We now turn to the case where . We will use a similar proof strategy by partitioning the state space based on the value of , albeit with a different partitioning. In particular, define the sets
[TABLE]
Note that by definition for all , so the above sets constitute a partition of . We have the following lemma.
Lemma 5.10
Fix so that all coordinates of are differentialble. If , then there exists a constant , such that .
*Proof. *Suppose that . Since , we have that . We have that
[TABLE]
where step follows from the fact that , and from . Finally, note that since , is strictly positive. This completes the proof of the lemma.
Lemma 5.10 implies that if , then we must have that
[TABLE]
Since the rate of change in every coordinate of is bounded, in light of the above equation, it suffices to establish Eq. (43) by considering only the scenario where for all . Fixing , we have that
[TABLE]
where step follows from the fact that whenever , from the fact that , and from the definition . In light of Eq. (66), Eq. (84) implies that
[TABLE]
Since the above equation holds for all , we have thus established an exponential rate of convergence of to , as . This completes the proof of Theorem 5.8.
5.4 Convergence of Steady-State Distributions
We complete the proof of Theorem 5.4 in this subsection. We will begin by establishing a useful distributional upper bound on . Fix , and denote by the number of jobs in system at time in an queue with arrival rate and departure rate , and by its steady-state distribution. It is well known that is a Poisson random variable with mean , and it follows that, for all ,
[TABLE]
Define the set
[TABLE]
and the process
[TABLE]
We have the following lemma, whose proof is given in Appendix D.2.
Lemma 5.11
The process is positive recurrent, and , for all .
Denote by the probability distribution over (Eq. (87)) that corresponds to the steady-state distribution of . Lemma 5.11 can be used to to show that the sequence is tight, as formalized in the following result, whose proof is given in Appendix D.3.
Lemma 5.12
For every , define , Then, for every , there exists such that
[TABLE]
We say that is a limit point of if there exists a sub-sequence of that converges to . It is not difficult to verify that the space is separable as a result of both and being separable. By Prohorov’s theorem, the tightness of thus implies that any sub-sequence of admits a limit point with respect to the topology of weak convergence. Let be a sub-sequence of , and a limit point of the sub-sequence. Denote by and the marginals of and over the first coordinates (corresponding to ), respectively. In the remainder of the proof, we will show that necessarily concentrates on .
We first show that is a stationary measure with respect to the deterministic fluid solution. To this end, we will use the method of continuous test functions (cf. Section 4 of Ethier and Kurtz, (2005)). Let be the space of all bounded continuous functions from to . We will demonstrate that, for all ,
[TABLE]
where, from here onward, we will use the subscript, , to indicate the distribution of .
Define , Fix . Let be distributed according to , and define
[TABLE]
Fix . We have that
[TABLE]
Step follows from the triangle inequality, and from the fact that is stationary when initialized according to , and therefore,
[TABLE]
For step , we note that since depends continuously on (Lemma 5.2), the function , belongs to . Therefore, the step follows from the fact that converges weakly to as .
Fix . There exists such that the the right-hand side of Eq. (93) satisfies
[TABLE]
where follows from the tightness property of Eq. (89), and step involves an argument that allows for interchanging the order of taking the limit and integration. We isolate step in the following lemma; the proof leverages Theorem 5.5 and is given in Appendix D.4.
Lemma 5.13
Fix a compact set and , we have that
[TABLE]
Fix . We have that
[TABLE]
where is the modulus of continuity of in : . Step follows from the fact that , and from Eq. (226). Because is compact, is uniformly continuous in and hence . The lemma follows by taking the limit as in Eq. (100).
Since Eq. (98) holds for all , combining it with Eq. (93), we conclude that
[TABLE]
and this proves Eq. (90). We now show that Eq. (90) implies that , i.e., is the unique invariant measure with respect to the dynamics induced by the fluid solution. Define the truncated norm:
[TABLE]
and the set
[TABLE]
Let be a random vector in distributed according to . Suppose, for the sake of contradiction, that for some , where the expectation is taken with respect to the randomness in . Fix such that . By Theorem 5.8, there exists such that . Therefore,
[TABLE]
where the first inequality uses the fact that for all . Because was assumed to be strictly positive, this is in contradiction with the stationarity of with respect to the fluid solution, which would imply that for all . We thus conclude that
[TABLE]
which in turn implies that . This proves Eq. (33).
Finally, to show Eq. (34), note that by Lemma 5.11, for all and , the random variable is non-negative and stochastically dominated by . Using similar steps as those in Eq. (225), it is not difficult to show that , which, implies the uniform integrability of . In combination with Eq. (33), this proves the convergence in stated in Eq. (34). This completes the proof of Theorem 5.4 as well as the first claim of Theorem 2.2.
6 The Memory-Deficient Regime
We now prove the second claim of Theorem 2.2 concerning the memory-deficient regime, where , as . The statement is repeated below for easy reference.
Theorem 6.1
Suppose that , as . Then,
[TABLE]
where .
Fix . Let be the th update point, and set . Denote by
[TABLE]
the embedded discrete-time process associated with the choice process . The discrete-time processes and are defined analogously.
The key to proving Theorem 6.1 hinges upon the following property: the discrete-time choice process, , becomes asymptotically Markovian in the limit as tends to infinity. Note that is not Markovian for a finite , because the choices are sampled from a distribution that is a function of the current recallable reward vector, which in turn depends on past choices. However, the dependence of the reward vector on past choices is greatly weakened in the memory-deficient regime: because the life span of a unit reward is so small compared to the time between two adjacent choices, by the time a new choice is to be made, the agent will have essentially forgotten all of the past rewards associated with any action other than the one she currently uses. Therefore, when is large, knowing the choice for some , we can predict with high accuracy the recallable reward vector at the update point, rendering the choice process approximately Markovian. The approximate Markovian property will allow us to explicitly characterize the steady-state distribution of in the limit as , which we then leverage to analyze the steady-state distribution of the recallable reward process, . Following this line of thinking, our proof will consist of three main parts:
(Proposition 6.3) Fix . We first show that, in the limit as , the value of the scaled recallable reward process at the -th update point converges in probability to a deterministic vector that only depends on . 2. 2.
(Proposition 6.4) Using the convergence of the recallable reward vector, we derive an explicit expression for the steady-state distribution of the discrete choice process, , in the limit as . 3. 3.
Using the stationarity of the original recallable reward process, we combine the above two results to arrive at a characterization of the steady-state distribution of , thus completing the proof of Theorem 6.1.
We begin by stating some basic properties of and , which will allow us to relate the steady-state behavior of these processes to their continuous-time counterparts; the proof of the following lemma is given in Appendix D.5.
Lemma 6.2
Fix . The discrete-time process is positive recurrent, whose steady-state distribution satisfies
[TABLE]
and
[TABLE]
For the remainder of the proof, we will fix to be an increasing function, such that , and, in particular,
[TABLE]
Define the scaled discrete-time recallable reward process: , . The following proposition states that when is large, the value of converges to a deterministic value that depends solely on : is equal to for , and zero, otherwise. In other words, the agent will have forgotten all past rewards other than those associated with her current choice. The proof is given in Appendix C.3.
Proposition 6.3
Define the matrix , where
[TABLE]
Suppose that the continuous-time process is initialized at according to its steady-state distribution. Then, for all ,
[TABLE]
and
[TABLE]
Using Proposition 6.3, we will be able to obtain an explicit expression for the limit , as is stated in the following proposition.
Proposition 6.4
Define , We have that
[TABLE]
where is a normalizing constant.
*Proof. *Suppose that is initialized at in its steady-state distribution. We have that, for all ,
[TABLE]
where . By Eq. (112) in Proposition 6.3, Eq. (115) implies that, there exists a sequence of vectors with , such that for all and ,
[TABLE]
where . Note that by the stationarity of , we have that . Therefore, by treating as a column vector, and defining the matrix, , where
[TABLE]
Eq. (116) can be written more compactly as
[TABLE]
It is not difficult to verify that is row-stochastic and corresponds to the transition kernel of a discrete-time, irreducible Markov chain over . Therefore, the equation admits a unique solution, , in the -dimensional simplex. Furthermore, it is not difficult to check that the following choice of satisfies , and is hence the unique solution:
[TABLE]
Since , combining Eqs. (118) and (119), we conclude that converges to coordinate-wise, as . This completes the proof of Proposition 6.4.
We are now ready to complete the proof of Theorem 6.1. Fix . Suppose that we initiate the continuous-time process in its steady-state distribution so that it is stationary. The stationarity implies that for all ,
[TABLE]
where is as defined in Proposition 6.4, and step follows from the fact that (Lemma 6.2). Taking the limit as , we have that
[TABLE]
where step follows from Eq. (113) in Proposition 6.3 and Eq. (114) in Proposition 6.4. This completes the proof of Theorem 6.1.
7 Proof of Theorem 2.1
We prove Theorem 2.1 in this section. Let be the discrete-time embedded Markov chain defined in Eq. (107). By Lemma 6.2, and admit the same distribution, and it hence suffices to characterize the former. Because the limiting expressions for in the memory deficient regime have already been established in Proposition 6.4, we will focus on the memory abundant regime, where . By Theorem 5.4, for all
[TABLE]
where the first step follows from Eq. (109) in Lemma 6.2. Fix . We have that
[TABLE]
where the last equality follows from Eq. (121) and the fact that is uniformly bounded over . Eq. (122), along with the expression for in Theorem 5.3, leads to Eqs. (5) and (6), depending on the value of and . This completes the proof of Theorem 2.1.
8 Extensions and Generalizations
We examine in this section a number of extensions to the original model. We will focus on key ideas with the intention of illustrating possible future directions, and the discussion will be more exploratory in nature, and less formal than that of our main results.
8.1 Fluid Solutions with Polynomial Choice Models
We consider in this sub-section a generalization of the reward-matching choice rule. Recall that under reward matching, the probability of choosing action when the recallable reward vector is is linearly proportional to . A natural generalization would be to consider a polynomially weighted reward-matching rule, where the probability of choosing action is proportional to , where is a fixed parameter.
An interesting question is how the agent’s behavior will depend on the value of , and we will provide some preliminary analysis for this question in this sub-section. We will focus our attention on the invariant state(s) of the corresponding fluid solutions for the recallable rewards, as it serves as a good proxy for the stationary distribution of the pre-limit process in the memory-abundant regime, and from it the stationary choice distributions may also be inferred. The memory-deficient regime is in fact much easier to analyze, and we will postpone the discussion of that regime till the end of this sub-section (Section 8.1.1).
In the fluid limit, the polynomial reward matching rule would lead to a modified function, (originally defined in Eq. (14)), given by
[TABLE]
It turns out that the dynamics of the fluid model will highly depend on whether or not, and we will divide our analysis into two cases as such.
Case 1: . The next proposition shows that when the fluid solutions admit a unique invariant state, and gives a characterization of its form.444Our original model already covers the case of , and we shall therefore focus on the case where . The proof of Proposition 8.1 is given in Appendix C.4.
Proposition 8.1
Fix and . The fluid solutions admit a unique invariant state, , whose expression depends on the value of , as follows.
Suppose that . Then , for all 2. 2.
Suppose that . Then
[TABLE] 3. 3.
Suppose that . There exists , such that for , and for . In particular, define the function
[TABLE]
Then, is the unique index that satisfies
[TABLE]
Furthermore, is the unique solution to the following equation in
[TABLE]
and the remaining coordinates of are given by:
[TABLE]
We can draw a few interesting observations from Proposition 8.1. The result suggests that the equilibrium behavior of the agent is largely consistent with the original model and our intuition: as decreases towards [math], the agent’s choice distribution tends to become uniform over the actions, since when is small, the agent’s choice probabilities depend weakly on the rewards of the actions. This is illustrated numerically in Figure 7-(a) via an example with two actions. The uniqueness also suggests that the stochastic approximation portion of our results (Theorem 2.2) should carry over to the regime of using the same arguments, without significant difficulty. Overall, these observations show that the insights from our original model should be fairly robust under small perturbation of in the direction towards [math].
Proposition 8.1 also reveals interesting new insights that are less obvious. Firstly, Item 1 of the proposition shows that the phenomenon of “complete oblivion” observed in the original model (see Theorem 2.2 and Section 2.1), where the agent chooses uniformly across actions as a result of a large exploration parameter , is not an artifact of being exactly equal to , and in fact continues to exist whenever . Moreover, the threshold on that marks the beginning of such oblivion is the same as before, and does not depend on .
Secondly, Item 2 of the proposition shows a regime not previously observed, and with a surprisingly clean expression: when is very small, the probability of choosing action in the invariant state is exactly proportional to , and this expressions, remarkably, does not depend on (to be contrasted with Eq. (5) which does depend on ). This result would appear to lead to a contradiction with Theorem 2.2: by increasing towards , Eq. (124) suggests that the agent will eventually concentrate % of her efforts on the best action and the degree of such concentration is independent of . In other words, how is it possible that the agent may choose the optimal action more frequently when than when ? To resolve this puzzle, we note that the regime in Item 2 requires , and the right-hand side, which is also equal to the invariant state of the worst action, , tends to [math] as . Therefore, as approach from below, the regime in Item 2 remains valid only if vanishes accordingly, thus resolving the paradox.
There is another important consequence of Item 2. Suppose that is fixed at a sufficiently small value, say, . Then, Item 2 becomes the only valid regime for all sufficiently small . As such, Eq. (124) provides a precise, quantitative description as to how the agent’s choice probabilities become uniform, as , namely, that the probability of choosing action will be equal to for all sufficiently small .
Finally, Item 3 of the proposition addresses the remaining scenarios, where is neither too small nor too big. Unfortunately, we no longer have a closed-form expression for this intermediate regime. Nevertheless, the invariant state can still be evaluated without much difficulty, and all coordinates of can be derived in closed-form as a function of and .
Case 2: . We now look at the second case, where . The dynamics of the fluid solutions turn out to be more complex than when , and we do not yet have a good understanding. To illustrate the complexities that one would encounter in this regime, we next show a simple example demonstrating that the fluid solutions could admit multiple invariant states.
Consider the a setting with two actions (), where and . The drifts of the fluid solutions for this example are visualized in Figure 7-(b). The fluid solutions admit at least three distinct invariant states, marked by the red dots in Figure 7-(b). First, one can verify that the vector , where and constitutes an invariant state (labeled in Figure 7-(b)). However, is not the unique invariant state, and nor is it stable: a perturbation along the direction of or , where is a small constant, can induce the fluid solution to move towards one of the other two (stable) invariant states, (labeled in Figure 7-(b)) and (labeled in Figure 7-(b)), respectively. Here, satisfies , , and . Analogous to , the state satisfies , and .
We offer some speculation as to why the multiplicity of invariant states emerges when . In this case, the agent leans disproportionally more towards actions with higher rewards, which means that sufficiently substantial advantages in the initial recallable rewards could lead an action to permanently dominate another action , even if . This logic suggests that, in principle, if the reward rates do not vary significantly across actions, then any measure that concentrates on a single action could be a stable invariant state for the choice process, so long as the fluid solution starts in a state where the action has a substantial amount of initial rewards compared to other actions; this seems to explain the two stable invariant states in Figure 7-(b), and . The unstable invariant state , on the other hand, seems to stem from a different type of dynamics: it is a point where the action with the superior reward rate (in this case, ) has fewer recallable rewards (), and the disadvantage in recallable rewards is exactly naturalized by the advantage in reward rate. The point is unstable because small perturbations could easily break this balance, inducing the advantage in the reward rate to dominate that of recallable rewards or vice versa.
What does this mean for the pre-limit, stochastic system? Note that while the fluid solutions can admit multiple invariant states when , the steady-state distribution for any pre-limit stochastic system () is always unique: the exponential decay of recallable rewards in the pre-limit system ensures that is positive recurrent, as it will return infinitely often to the state [math]. Nevertheless, the proceeding analysis of the fluid solutions suggests that the stochastic system will likely experience a qualitative change as well once is greater than : when is large, instead of converging over time to the neighborhood of a single configuration, as is the case when , is likely to alternate between multiple different configurations, each corresponding to one of the multiple invariant states in the fluid solutions.
8.1.1 General Choice Models
At a higher level, the reward matching rule is a special case of a broader family of choice heuristics, in which action is chosen with a probability proportional to , where is a non-decreasing weight function. The polynomial reward-matching rule studied earlier corresponds to , for some . One may also consider for some constant , the sampling procedure becomes reminiscent of the celebrated logit model in choice theory (cf. Chapter 3, Train, (2009)) as well as the exponential-weight algorithm in the multi-arm bandit literature (cf. Auer et al., (2002)).
Can we extend our results beyond the polynomial reward-matching rule to include this even broader family of choice models? On the positive side, it is not difficult to check that the same analysis for the memory-deficient regime can be readily extended to these models: the limiting probability of choosing action would become proportional to , i.e., with in our original result being replaced by . Note that if we had the freedom of choosing , then applying a highly skewed weight function, e.g., , would induce the agent to almost always choose the best action in steady state, even in the memory-deficient regime. However, the downside is that with a skewed weight function, the agent is highly unlikely to give up the current choice at an update point, regardless of its reward rate. It will thus take a long time for the agent’s time-average behavior to approach that of the steady state, resulting in bad transient performance. A similar trade-off seems to be present in the context of the exploration parameters, , where a smaller tends to improve steady-state performance at the expense of a slower convergence to the steady-state dynamics. It would be interesting to obtain a more precious and rigorous understanding of the trade-off between the agent’s steady-state versus transient performance.
On the negative side, extensions to general choice models appear to be significantly more difficult in the memory-abundant regime, as is evidenced by the proceeding analysis of the polynomial reward-matching model. Making progress on this front is likely to require more sophisticated analysis to cope with the complex dynamics that will arise in fluid solutions, such as when in the polynomial reward-matching rule.
8.2 Non-Exponential Lifespan Distributions
We have thus far assumed that the lifespans of rewards follow an exponential distribution, and it would be interesting to see whether the qualitative insights from our model would continue to hold under other, non-exponential lifespan distributions. In this subsection, we experiment with two other lifespan distributions: in the first case, we set the lifespan to be a constant, , and in the second case, the lifespan follows a (heavy-tailed) Pareto distribution with scale , shape , and mean . Figure 8 shows the steady-state distribution of choices under these two scenarios, with . Interestingly, comparing the results in Figure 8 to those from original model (Figure 4-(b)), we see that the distribution of choices is largely insensitive to the choice of lifespan distribution, and the theoretical predictions made in Theorem 2.2 appear to hold even under these non-exponential distributions.
8.3 Heterogeneous Memory Decay Rates
As was alluded to in Example 2 in the Introduction, we may also consider the case where the memory decay rates are not constant across actions, so that the rewards associated with action depart at rate . This generalization is useful for modeling service system where the rate of customer departure or attrition depends on the service type a customer is associated with.
Analogous to the case of uniform decay rate, the scaling of Section 1.2 corresponds to setting in the th system, where are positive constants. Under this scaling, it is not difficult to verify that both Theorems 2.1 and 2.2 extend to this more general case, if we replace every occurrence of with a decay-rate-weighted version, . In other words, as far as the limiting system is concerned, the version of the problem with heterogeneous memory decay rates is equivalent to one with uniform decay rates but appropriately weighted reward rates. We should note that this equivalence only holds in the limit as , and is not exact in a pre-limit system with a finite .
9 Discussion
We have proposed in this paper a stochastic action-reward model for studying the impact of imperfect memory in dynamic decision making. In the limiting regimes where the agent’s memory span is large, our main results provide exact expressions for the steady-state choice probabilities of an agent following the so-called reward-matching rule, and demonstrate that these probabilities are highly sensitive to the relative scaling between two parameters: the rate of memory decay, , and the rate at which the agent makes new choices, .
There is a number of potentially interesting extensions of our formulation. For instance, the current paper focuses on the regime where the rate of memory decay , and alternatively, one may also look at an opposite limiting regime where tends to infinity, i.e., the agent becomes extremely forgetful and her memory span approaches zero. Note that since the total amount of recallable rewards in system is on the order of (regardless of the value of ), in this limit we should see very few recallable rewards at any point in time and the process of recallable rewards frequently hitting the zero state. It is therefore reasonable to assume that the reward-matching rule will perform poorly when is very large, with the agent choosing all actions essentially uniformly at random. Unfortunately, we do not yet have the tools needed to characterize this regime, since the fluid approximations developed in this paper heavily rely on mean-field-type evolution of a large amount of recallable rewards and are unlikely to be applicable when is large.
The present reward-matching model is symmetric, in the sense that the agent does not have any inherent preferences of one action over another. One way to relax this assumption is to allow the exploration parameter to depend on the action, so that action will be sampled with probability . In the memory-deficient regime, it would not be difficult to show that this results in a steady-state choice distribution where action is chosen with probability proportional to . The memory-abundant regime will become more challenging to analyze, because the heterogeneity of would in general break the monotonicity of the invariant-state recallable rewards, since now an action with a higher reward rate may not necessarily have a higher recallable reward in the invariant state. It would be an interesting future direction to better understand how to incorporate inherent preference asymmetries into the formulation.
While the present paper focuses largely on the theory, there appears to be several interesting potential applications of our model. First, the reward matching rule can be viewed as a simplified cognitive model to capture humans’ unconscious reinforcement behavior, and it will be interesting to see if the theory could be used to make predictions on a consumer’s long-run choice behavior when switching among similar products or services (eg., Example 1 of Section 1), or to investigate the effect of memory loss on a player’s learning and performance in repeated games similar to those studied by Erev and Roth, (1998). Second, memory loss can be interpreted in a more metaphorical sense, and be used to model the departure of customers (eg., Example 2 of Section 1). Cast in this light, the reward matching policy may act as a simple and intuitive heuristic in a manager’s algorithmic toolbox for dynamically choosing product offerings or demographic targets for advertisements. Finally, the dynamics of the recallable rewards in our system resemble that of a queueing network. For instance, the departures of rewards for each individual action function similarly to service completions in an infinite-server queue. There has been a growing literature in recent years on queuing systems where customers make their own choices regarding abandonments or the types of services they would like to receive (cf. Hassin and Haviv, (2003), Pender et al., (2016), Ding et al., (2016), Dong et al., (2018)), and it would be interesting to see whether our model can be applied to the study of (strategic) choice behavior in these service systems.
While memory loss is regarded as a given constraint in the current work, one could also ask whether it would be beneficial to artificially induce memory loss even when perfect recall is feasible. For instance, it has been observed in the multi-arm bandit literature that policies with built-in regularization that penalizes distant past experience can achieve better regret with time-varying, or adversarially chosen, parameters (cf. Garivier and Moulines, (2008), Van Erven and Kotl, (2014), Besbes et al., (2015), Keskin and Zeevi, (2016)). This is because these policies tend to be more adaptive to the environment and not “weighed down” by past experience. While our theorems do not directly apply to time-varying reward rates, Theorem 5.8 shows that the fluid solution converges exponentially quickly to the invariant state from any bounded initial condition, suggesting that the reward matching rule could be an attractive algorithm for dynamic learning in non-stationary environments.
Finally, at a higher level, there appear to be other interesting ramifications of “memory loss” in organizations to be explored. For instance, departures of employees could lead to the loss of skills and expertise (cf. Benkard, (2000)), and such organizational forgetting could potentially affect a firm’s efficiency in a significant way.
10 Acknowledgment
The authors would like to thank the anonymous referees at Mathematics of Operations Research for their comments and input, and Professors Steven Callander, Dana Foarta, J. Michael Harrison, David M. Kreps, Mihalis Markakis, Michael Ostrovsky, Daniel Russo, Daniela Saban and Takuo Sugaya for the insightful discussions and feedback on the manuscript. Yun would like to acknowledge the support from the National Research Foundation of Korea (NRF) grant No. 2019028324, funded by the government of Korea (MSIT).
Appendix A Comparison to Perfect Memory
We discuss in this appendix what could happen if there were no memory decay in our model, i.e., if , and why it is different from the limiting regime considered in this paper, with .
When , all recallable rewards remain in the system indefinitely. If we view the recallable rewards sampled at the update points as a discrete-time process, and in addition set the exploration parameter to 0, then our model becomes essentially the same as the choice process analyzed by Beggs, (2005), where it is shown that the probability of choosing the best action converges to one as under Luce’s rule, as long as each action is associated with some strictly positive initial rewards. If is a positive constant, because there is no memory decay, the effect of disappears as soon as the rewards for all actions exceed , and the same conclusion should hold.
Therefore, one would expect that when and , the choice probability under the reward-matching rule will concentrate on the best action as , regardless of the update rate, . This is however different from the conclusion of Theorem 2.1, which shows the existence of two distinct limiting steady-state probabilities, one of which does not exhibit concentration on the best action. These observations thus suggest that our scaling regime do capture unique effects of imperfect memory. This is perhaps not too surprising in hindsight: if we had set to zero, any positive update rate would become, by definition, significantly greater than , and hence the second (memory-deficient) regime in Theorems 2.1 and 2.2 could not have appeared when .
Appendix B Technical Preliminaries
Proposition B.1** (Doob’s inequality)**
*(cf. Section 12.6, Grimmett and Stirzaker, (2001))
Let be a discrete- or continuous-time non-negative submartingale. Fix . We have that*
[TABLE]
Proposition B.2** (Gronwall’s lemma)**
(cf. Section 1.3, Ames and Pachpatte, (1997)) Let be continuous and non-negative functions, and let be a continuous, positive and non-decreasing function. If , for all then
[TABLE]
Appendix C Proofs of Propositions
C.1 Proof of Proposition 5.6
The proof of Proposition 5.6 is largely based on maximal inequalities for continuous-time martingales, and the main steps follow the approach developed in Kurtz, (1978). Define , where
[TABLE]
The proof proceeds in two stages: we first show, via Gronwall’s lemma (Proposition B.2 in Appendix B), that bounding the deviation of from can be reduced to bounding the magnitude of . We then show that this can be accomplished by writing as a sum of two continuous-time martingales. From the definition of , we have that
[TABLE]
For step , we observe that for all , is a -Lipschitz continuous function. We now apply Gronwall’s lemma (Proposition B.2 in Appendix B) to Eq. (135), with , , and , and obtain that
[TABLE]
Because we have assumed that for all , to prove Lemma 5.6, it therefore suffices to show that
[TABLE]
We now prove Eq. (137) by expressing as the sum of two continuous-time martingales corresponding to the arrivals and departures of rewards, respectively. Fix and , and denote by and the amount of rewards associated with action that have arrived and departed, respectively, during the interval . In particular, we can write
[TABLE]
We have that
[TABLE]
For the remainder of the proof, we will focus on showing that, for all ,
[TABLE]
In light of Eq. (140), the above two equations imply the validity of Eq. (137), which proves Proposition 5.6. The proofs for both Eqs. (141) and (142) hinge upon the following technical result, which states that the sample path of a certain time-inhomogenous Poisson process stays close to its mean, with high probability. Note that a similar result for uniform-rate Poisson processes can be found in Lemma 7.6 of Massey and Whitt, (1998). The proof of the lemma is given in Appendix D.6, and uses a similar line of argument as that of Theorem 2.2 in Kurtz, (1978).
Lemma C.1
Fix and . Let be the counting process where is the number of times in that the process jumps from state to , for some . Denote by the corresponding rate function of , so that the instantaneous transition rate of at time is equal to . For all , we have that
[TABLE]
where .
We now prove Eqs. (141) and (142). In the context of Lemma C.1, is the counting process with , where is the vector vector whose th coordinate is and all other coordinates zero, corresponding to an arrival to . The rate of at time is equal to , which is bounded from above by for all . By applying Lemma C.1, with and , we have that , and for all ,
[TABLE]
The proof of Eq. (142) uses essentially the same idea, but the argument needs to be more delicate due to the fact that the rate of the counting process , which is equal to , is not bounded over the state space. Therefore, we first derive an upper bound on the tail probabilities of , as follows. Let be the rate function for as in Lemma C.1, and . Fixing , we have that, for all ,
[TABLE]
where step follows from the fact that is independent of , and hence the maximum in the second term is attained by setting . Step follows from for all , and from the inequality in Eq. (143), by replacing with .
We are now ready to establish Eq. (142):
[TABLE]
where step follows from Lemma C.1, and from Eq. (146). Note that and are positive constants, and by our assumption, for all . Therefore, Eq. (142) follows by taking the limit in the above inequality as . This completes the proof of Proposition 5.6.
C.2 Proof of Proposition 5.7
For the purpose of this proof, we will use an alternative representation of the fluid solutions using integral equations. Define the drift function, :
[TABLE]
where is defined in Eq. (14). It can be verified from the definition of that there exists such that is -Lipschitz continuous for all . Fix , let , , be a solution to the following integral equation:
[TABLE]
Similar to Lemma 5.2, it is not difficult to show that the function defined in Eq. (149) exists, is unique, and coincides with the fluid solution with initial condition . We have that
[TABLE]
where the last inequalities come from the fact that is -Lipschitz continuous for all . In a manner analogous to Eq. (135) from the proof of Proposition 5.6, by applying Gronwall’s lemma (Proposition B.2 in Appendix B), we have that, for all ,
[TABLE]
Therefore, in order to establish Proposition 5.7, it suffices to show that
[TABLE]
In what follows, we will show (154) by using the discrete-time embedded process of and analyzing the system dynamics at the times when new choices are chosen. We begin by introducing some notation. Fixing , we denote by the th update point, i.e., time of the th update in , with , and
[TABLE]
Note that are independent exponential random variables with mean . Let be the discrete-time process, where corresponds to the value of immediately following the th update point:
[TABLE]
and be the process of indicator variables:
[TABLE]
That is, if action is selected on the th update point. Finally, let be a right-continuous piece-wise constant process which coincides with at the points :
[TABLE]
Fix , and denote by the number of updates in by time , i.e.,
[TABLE]
By the triangle inequality, we have that
[TABLE]
We now derive upper bounds on tail probabilities for each term on the right-hand side of Eq. (161). For the first term, define
[TABLE]
Fix and . Let
[TABLE]
(To avoid the excessive use of floors and ceilings, we assume that is a positive integer. The results extend easily to the general case.) Define the events
[TABLE]
We have that
[TABLE]
With the above equation in mind, we now proceed to demonstrate that both and ( converge to zero in the limit as . Note that is a Poisson random variable with mean . Using elementary tail bounds on the Poisson distribution, we have that
[TABLE]
which converges to zero as .
We next turn to the value of . Recall from the definition of that
[TABLE]
It is therefore not difficult to verify that is a martingale, and our objective would be to derive an upper bound on its maximum upward excursion over . Unfortunately, we cannot apply the Azuma-Hoeffding inequality, because the th increment of involves the term , which does not admit a bounded support. Instead, we will use the following upper bound on the moment generating function of , whose proof is based on Doob’s inequality and is given in Appendix D.7.
Lemma C.2
Fix and . We have that
[TABLE]
We are now ready to establish an upper bound on the quantity . Recall that . Fix , and let In particular, . We have that
[TABLE]
Step follows from Doob’s inequality and the fact that, for any , the sequence is a submartingale, and from Lemma C.2 with . Step is due to , and hence . Finally, step follows from the definition of and that .
With a derivation analogous to that of Eq. (172), we have that
[TABLE]
Therefore, combining Eq. (165) with Eqs. (166), (172) and (173), we have that
[TABLE]
Since as , the above equation further implies that
[TABLE]
We now bound the second term in Eq. (161). It is not difficult to verify, from Eq. (14), that there exists such that is -Lipschitz continuous for all . We have that
[TABLE]
It therefore suffices to show that, if as , then
[TABLE]
To this end, we have that
[TABLE]
In step we have invoked the property that is piece-wise constant. Step follows from the definition of in Eq. (37), and from the fact that for all (Eq. (40)). It remains to derive an upper bound on the tail probabilities of the term , which is isolated in the form of the following lemma. The proof involves an elementary application of Markov’s inequality, and is given in Appendix D.8.
Lemma C.3
Suppose that as . We have that
[TABLE]
We are now ready to prove Eq. (177). By Eq. (178), we have that
[TABLE]
for all , where the first inequality follows from a union bound, and the last step from Proposition 5.6 and Lemma C.3. Substituting Eqs. (175) and (180) into Eq. (161), we have that
[TABLE]
This establishes Eq. (154), which, in light of Eq. (153) completes the proof of Proposition 5.7.
C.3 Proof of Proposition 6.3
*Proof. *We begin by showing the following, strengthened version of Lemma 5.11, which states that a similar stochastic dominance property holds for even when conditioning on being of a specific value.
Lemma C.4
Let be a random vector drawn from the steady-state distribution, . There exist constants , and , such that for all ,
[TABLE]
*Proof. *Let . We have that, for all , ,
[TABLE]
By Eq. (109) of Lemma 6.2, the above equation implies that
[TABLE]
where the last step follows from the fact that is always bounded from above by .
By Eq. (86), we have that almost surely as . Therefore, there exist and , such that
[TABLE]
Combining Eq. (108) in Lemma 6.2 with Eqs. (184) and (185), we have that, for all ,
[TABLE]
where . Step follows from Eq. (184), from Lemma 5.11 and a union bound, and from Eq. (185).
Fix . We have that, for all ,
[TABLE]
for all , where step follows from Lemma 5.11, and from Eq. (186). Since the above inquality holds for all , this completes the proof of Lemma C.4.
We now prove the convergence in Eq. (112). Recall that the first update point, , is exponentially distributed with mean , and independent from . Define the event . We have that
[TABLE]
Fix . Recall from Eq. (86) that converges to almost surely as . This implies that there exists , independent of , such that
[TABLE]
where step follows from Lemma C.4.
Fix . We have that
[TABLE]
where step follows from the independence between and , and from Eq. (188).
We now bound the term on the right-hand side of Eq. (190). Denote by a binomial random variable with trials and a success probability of per trial, and by the number of jobs in system at time in an initially empty queue with arrival rate and departure rate . We have that
[TABLE]
For step , note that each unit of reward initially present at time has probability of or remaining in the system by . Therefore, the rewards in site at time satisfy the following decomposition:
[TABLE]
The first term corresponds to the units of rewards at that had arrived during the interval , and hence is non-zero only if . The second term corresponds to those individuals initially present at who remained in the system by . Step follows from the definition of , and from the well-known fact that the number of jobs in system in an initially empty queue at any time is always stochastically dominated by its steady-state distribution.
Because as , we have that . Applying Markov’s inequality, we obtain that
[TABLE]
where denotes convergence in probability. Recall from Eq. (86) that, almost surely,
[TABLE]
Fix , and substitute Eq. (191) into Eq. (190). We have that
[TABLE]
where step follows from Eq. (189), and from Eqs. (193) and (194). Using the same line of arguments as that in Eqs. (190) through (195), we can show that
[TABLE]
which, along with Eq. (195), yields that
[TABLE]
Since the above equation holds for all , this proves Eq. (112) in Proposition 6.3.
We now turn to Eq. (113). Fix . Using essentially identical arguments as those for Eq. (197), we can show that
[TABLE]
We have that,
[TABLE]
Step follows from a decomposition similar to Eq. (192), by the writing the recallable rewards at time as those who arrived after , which is dominated by , and those who were in the system at , which is dominated by . Step follows from Lemma 5.11. Since is a Poisson distribution with mean , it is not difficult show that there exists a random variable , such that
[TABLE]
Combining Eqs. (112) and (200), the dominated convergence theorem implies that, for all ,
[TABLE]
This shows Eq. (113), and thus completes the proof of Proposition 6.3.
C.4 Proof of Proposition 8.1
*Proof. *Fix . Under the polynomial reward-matching model, the fluid solution satisfies
[TABLE]
setting the left-hand side to [math], we have that a state is an invariant state of the fluid solutions if
[TABLE]
We first show that the above equations admit a unique solution, . That is, the fluid solutions admit a unique invariant state. Suppose, for the sake of contradiction, that there exist two distinct invariant states and . Let
[TABLE]
and denote the denominators on the right-hand side of Eq. (203) under and , respectively. From Eq. (203), by considering separately two cases depending on whether is smaller than , we have that the invariant state satisfies
[TABLE]
which indicates that if , then . Therefore, in order for and to be distinct, we must have that . Without loss of generality, let us assume that . Because , by Eq. (205), we have that is a monotonically decreasing function of , for all . We thus have that for all . This leads to a contradiction, since when for all , we will necessarily have that is strictly less than . This proves that the solution to Eq. (203) must be unique.
We now find the unique invariant state , and for now we assume that such exists. Note that when , is a monotonically increasing function of for . Eq. (205) implies that if and only if , which further implies that . Since is non-increasing in , we may define as the unique index such that
[TABLE]
where we define if , and if .
We now consider different values of . Suppose that . It is not difficult to verify that in this case , for all , and . It follows from Eq. (205) that we must have , or equivalently, , and that
[TABLE]
This proves Item 1 in the proposition.
Consider next the other extreme where is so small that and for all . By Eq. (205), this is to say that
[TABLE]
In this case, we have that which leads to, after rearrangement,
[TABLE]
Substituting the value of from Eq. (209) into (208), we obtain the condition on :
[TABLE]
The expression of in Eq. (124) is obtained by substituting Eq. (209) into the top line of Eq. (205). This proves Item 2 of the proposition.
Finally, fix such that . In this case, we have that . By Eq. (205), we have that for all ,
[TABLE]
where the last equality follows from the fact that , and hence . This yields
[TABLE]
which proves Eq. (130) in Item 3. It remains to identify the values of and . By Eq. (205), in order to have , it is necessary and sufficient to have
[TABLE]
where the equalities and are based on Eq. (214), and inequality uses the fact that . Analogously, in order to have , it is necessary and sufficient to have
[TABLE]
where inequality is derived from the bottom line of Eq. (214):
[TABLE]
Eqs. (218) and (222) thus give a charaterization of in terms of the problem primitives, and establish Eq. (126). Note that the proceeding analysis has already shown that is unique and lies in , although the uniqueness can also be derived easily by noticing that is a non-decreasing function (since whenever ) and the ’s are non-increasing in .
Finally, we show that exists and is given by the unique solution to Eq. (127). Substituting the expressions for the ’s from Eq. (214) into Eq. (203) leads to Eq. (127). In particular, is the solution to the following equation:
[TABLE]
To see that such a solution exists and is unique for any fixed , note that since , the left-hand side of the above equation is a strictly increasing function in , which grows from [math] to as varies from [math] to ; the right-hand side, on the other hand, is a strictly decreasing function in , which, as varies from [math] to , decreases from to [math]. Together, they imply that Eq. (223) must admit a unique solution in . This completes the proof of Proposition 8.1.
Appendix D Proofs for Lemmas
D.1 Proof of Lemma 5.2
*Proof. *The existence and uniqueness of the fluid solution follow from Picard’s existence theorem (Section 2, Chapter 1, Coddington and Levinson, (1955)) by verifying that the right-hand side of Eq. (15) is uniformly Lipschitz-continuous in over , and (trivially) continuous in . To show the fluid solution’s continuous dependence on initial condition, note that because of the Lipschitz continuity of , there exists a constant , such that for initial conditions and , we have that
[TABLE]
where the last inequality follows from Gronwall’s lemma (Proposition B.2). Therefore, for all . This completes the proof of the lemma.
D.2 Proof of Lemma 5.11
*Proof. *We will use a simple coupling argument as follows. Fixing , the evolution of the process corresponds to that of if action were selected for all , and it follows, given the same initial condition, that is stochastically dominated by for all . Since is positive recurrent for all , we know that is also positive recurrent, which in turn implies the positive recurrence of , because the evolution of is derived by sampling from based solely on the value of . Lemma 5.11 follows form the above-mentioned stochastic dominance, the fact that under any finite initial condition, converges in distribution to as , and the observation that whenever .
D.3 Proof of Lemma 5.12
*Proof. *To show Eq. (89), observe that
[TABLE]
where step follows from Lemma 5.11 and the union bound, and from the elementary inequality . Eq. (225) thus shows that for all , for all sufficiently large , which proves Eq. (89).
D.4 Proof of Lemma 5.13
*Proof. *We first show the following uniform convergence property:
[TABLE]
where denotes a process initialized with . Suppose, for the sake of contradiction, that there exist and a sequence , , such that
[TABLE]
Because is compact, there exists a sub-sequence and such that as . We have that
[TABLE]
where step follows from the fact that, for all , is continuous with respect to (Lemma 5.2). This leads to a contradiction with Theorem 5.5, and hence proves Eq. (226).
D.5 Proof of Lemma 6.2
*Proof. *It is not difficult verify that is a time-homogeneous, aperiodic, and irreducible Markov chain. Because the continuous-time process, , is positive recurrent, so is its discrete-time counterpart, , and converges to its steady-state distribution, , as . Eq. (108) follows from the same argument as that of Lemma 5.11. We now show Eq. (109). Denote by the index of the last update point by time :
[TABLE]
and by its value
[TABLE]
Recall that the update points are generated according to a Poisson process, and it not difficult to show that, almost surely, and , as . We thus have that, for all ,
[TABLE]
The same argument applies for versus . This completes the proof of Lemma 6.2.
D.6 Proof of Lemma C.1
*Proof. *Define
[TABLE]
Let be the natural filtration associated with . It is not difficult to show, by the definition of , that is martingale with respect to . Define the stopping time
[TABLE]
Let be a counting process defined as:
[TABLE]
That is, coincides with up until , and stays constant afterwards. Let
[TABLE]
Then, it is not difficult to show that is a counting process whose instantaneous rate at time is , and the process
[TABLE]
is a martingale with respect to . From the definition of and , we have that, for all ,
[TABLE]
To complete the proof, therefore, it suffices to show that
[TABLE]
We now show Eq. (238) using Doob’s inequality (Proposition B.1 in Appendix B), by following a line of arguments similar to that used in the proof of Theorem 2.2 of Kurtz, (1978). First, we introduce a representation of the Markov process using Poisson processes. Let a family of mutually independent unit-rate Poisson counting processes, indexed by . For every , let be the rate function of for the jump with value , i.e., is the instantaneous rate at which jumps to state when in state . Then, the process can be expressed as a solution to the following integral equation:
[TABLE]
In this representation, counts the number of jumps of value over the interval . We thus have that, by setting ,
[TABLE]
and
[TABLE]
Fix . Since is a martingale and a positive convex function, is a submartingale. From Eq. (241), we have that
[TABLE]
where . Note that is a stopping time with respect to . Since by definition for all , we have that . Applying the optional sampling theorem for submartingales indexed by partially ordered sets (cf. Washburn and Willsky, (1981)) to , we have that
[TABLE]
where the last inequality follows from the fact that is a Poisson random variable with mean whose moment generating function is given by , . Analogously, we can show that
[TABLE]
where the last inequality follows from the fact that for all .
Note that for any , both and are non-negative submartingales. Using Doob’s inequality and Eqs. (243) and Eq. (244), we have that, for all ,
[TABLE]
By setting in Eq. (246), we conclude that
[TABLE]
where . This completes the proof.
D.7 Proof of Lemma C.2
*Proof. *We show the result by induction. For the base case, we extend the definition of by letting , and it is not difficult to see that the inequality holds when .
Fix , and suppose that
[TABLE]
In light of the base case, it then suffices to show that the above equation implies
[TABLE]
Let be the natural filtration induced by , with . We have that
[TABLE]
where step follows from being -measurable. We now develop an upper bound for the second term on the right-hand side of Eq. (250), as follows.
[TABLE]
Step follows from the fact that, for a given value of , is a Bernoulli random variable with , and is independent from . For step , note that is an exponential random variable with mean , independent from and . Its moment generating function is given by for all , where , in our case, corresponds to and , for the two terms respectively. Step stems from the fact that for all . Finally, for step we have used the fact that by definition, and that ; the exponent is hence bounded from above by setting to and [math] in the numerator and denominator, respectively.
Substituting Eq. (258) into Eq. (250), and invoking the induction hypothesis of Eq. (247), we have that
[TABLE]
This proves our claim.
D.8 Proof of Lemma C.3
*Proof. *Fix and . We have that
[TABLE]
where step follows from Eq. (166), from the Markov’s inequality, and from being an exponentially distributed random variable with mean and hence . Because as , the claim follows.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ames and Pachpatte, (1997) Ames, W. F. and Pachpatte, B. (1997). Inequalities for Differential and Integral Equations , volume 197. Academic Press.
- 2Auer et al., (2002) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing , 32(1):48–77.
- 3Beggs, (2005) Beggs, A. W. (2005). On the convergence of reinforcement learning. Journal of Economic Theory , 122(1):1–36.
- 4Benkard, (2000) Benkard, C. L. (2000). Learning and forgetting: The dynamics of aircraft production. The American Economic Review , 90(4):1034–1054.
- 5Benveniste et al., (2012) Benveniste, A., Métivier, M., and Priouret, P. (2012). Adaptive algorithms and stochastic approximations , volume 22. Springer Science & Business Media.
- 6Besbes et al., (2015) Besbes, O., Gur, Y., and Zeevi, A. (2015). Non-stationary stochastic optimization. Operations Research , 63(5):1227–1244.
- 7Bramson, (1998) Bramson, M. (1998). State space collapse with application to heavy traffic limits for multiclass queueing networks. Queueing Systems , 30(1-2):89–140.
- 8Coddington and Levinson, (1955) Coddington, E. A. and Levinson, N. (1955). Theory of Ordinary Differential Equations . Tata Mc Graw-Hill Education.
