Expected Sarsa($\lambda$) with Control Variate for Variance Reduction
Long Yang, Yu Zhang, Jun Wen, Qian Zheng, Pengfei Li, Gang Pan

TL;DR
This paper introduces a variance reduction technique for off-policy reinforcement learning algorithms using control variates in Expected Sarsa(λ), resulting in lower variance and improved convergence properties compared to existing methods.
Contribution
The paper proposes the ES(λ)-CV algorithm with control variates for variance reduction and extends it to GES(λ) for convergence with linear function approximation.
Findings
ES(λ)-CV has lower variance than Expected Sarsa(λ).
GES(λ) achieves a convergence rate of O(1/T).
Numerical experiments show better performance than state-of-the-art algorithms.
Abstract
Off-policy learning is powerful for reinforcement learning. However, the high variance of off-policy evaluation is a critical challenge, which causes off-policy learning falls into an uncontrolled instability. In this paper, for reducing the variance, we introduce control variate technique to () and propose a tabular ()- algorithm. We prove that if a proper estimator of value function reaches, the proposed ()- enjoys a lower variance than (). Furthermore, to extend ()- to be a convergent algorithm with linear function approximation, we propose the () algorithm under the convex-concave saddle-point formulation. We prove that the convergence rate of ()…
| Algorithm | Reference | Step-size | Convergence Rate |
|---|---|---|---|
| (?) | , | ||
| (?) | |||
| (?) | , | ||
| (?) | constant step-size | ||
| (?) | , | ||
| (?) | |||
| Ours | constant step-size |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Control Systems Optimization · Advanced Bandit Algorithms Research
Expected Sarsa() with Control Variate for Variance Reduction
Long Yang, Yu Zhang, Jun Wen, Qian Zheng, Pengfei Li, Gang Pan
Department of Computer Science, Zhejiang University
{yanglong,hzzhangyu,junwen,qianzheng,pfl,gpan}@zju.edu.cn
Abstract
Off-policy learning is powerful for reinforcement learning. However, the high variance of off-policy evaluation is a critical challenge, which causes off-policy learning falls into an uncontrolled instability. In this paper, for reducing the variance, we introduce control variate technique to () and propose a tabular ()- algorithm. We prove that if a proper estimator of value function reaches, the proposed ()- enjoys a lower variance than (). Furthermore, to extend ()- to be a convergent algorithm with linear function approximation, we propose the () algorithm under the convex-concave saddle-point formulation. We prove that the convergence rate of () achieves , which matches or outperforms lots of state-of-art gradient-based algorithms, but we use a more relaxed condition. Numerical experiments show that the proposed algorithm performs better with lower variance than several state-of-art gradient-based TD learning algorithms: (), () and ().
Introduction
Off-policy learning is powerful for reinforcement learning due to it learns the target policy from the data generated by another policy (?). However, suffering high variance is a critical challenge for off-policy learning (?), which roots in the discrepancy of distribution between target policy and behavior policy. The resources of high variance of off-policy learning can be divided into two parts, (I) one is tabular case which has to do with the target of the update, (II) one is with function approximation which has to do with the distribution of the update (?).
In this paper, we mainly focus on the variance reduce technique to an important off-policy algorithm: (). We introduce control variate to () and propose () with control variate (()-) for the tabular case. The control variate method is one of the most effective variance reduction techniques in statistical inference (?). Control variate is an additional term that has zero expectation, which implies introducing control variate does not change the expectation of update. Thus, learning with control variate does not introduce any biases, but it is potential to enjoy much lower variance (?; ?; ?). Sutton and Barto (?) (section 12.9) firstly introduces control variate to (), but their analysis is limited in linear function approximation. Later, De Asis and Sutton (?) further introduce control variate to multi-step TD learning, but it constrains on off-line learning (which is extremely expensive for training).
Despite being easy to implement, competitive to the state of the art methods, and being used in practice, in RL, the TD learning with control variate technique lacks a robust theoretical analysis. In this paper, we focus on the theoretical analysis of ()-. We prove that the tabular ()- converges at an exponential fast for off-policy evaluation without biases. Furthermore, we analyze all the random sources lead to the variance of ()-, and we prove that if a proper estimator of value function reaches, ()- enjoys a lower variance than ().
Furthermore, we show the variance reduction way presented by (?) (section 12.9) to extend ()- with function approximation is unstable. Although this instability has been realized by Sutton and Barto (?), it is only an intuitive guess inspired previous works (?; ?). In this paper, we provide a simple but rigorous theoretical analysis to illustrate the instability appears in (?). We also demonstrate this instability by a typical example.
To extend the ()- with function approximation be a convergent and stable algorithm, we propose () algorithm under the the convex-concave saddle-point formulation (?). We prove the convergence rate of () achieves , where is the number of iterations. Our matches or outperforms extensive state-of-art works (?; ?; ?; ?; ?; ?), with a more relaxed condition than theirs. Besides, we prove the results of convergence rate without the assumption that the objective is strongly convex in the primal space and strongly concave in the dual space (?).
Finally, we conduct numerical experiments to show that the proposed algorithm is stable and converges faster with lower variance than lots of state-of-art gradient-based TD learning algorithms: () (?), () (?), and () (?).
Contributions
- •
We introduce control variate technique to () and propose a tabular ()- algorithm. We prove that if a proper estimator of value function reaches, the proposed ()- enjoys a lower variance than ().
- •
We propose the (), which extends ()- to be a convergent algorithm with linear function approximation. We prove that the convergence rate of () achieves , which matches or outperforms lots of state-of-art gradient-based algorithms, but we use a more relaxed condition.
Preliminary and Some Notations
In this section, we introduce some necessary notations about reinforcement learning, temporal difference learning and -return. For the limitation of space, we more discussions about -return in Appendix A and B.
Reinforcement Learning The reinforcement learning (RL) is often formalized as Markov decision processes (MDP) (?) which considers 5-tuples form . is the set contains all states, is the set contains all actions. , is the probability for the state transition from to under taking the action . , . is the discount factor.
A policy is a probability distribution on . Target policy is the policy will be learned and behavior policy is used to generate behavior. denotes a trajectory, where and . For a given policy , its state-action value function , state value function , where and denotes an conditional expectation on all actions which be selected according to . It is known that is the unique fixed point (?) of Bellman operator ,
[TABLE]
which is known as Bellman equation, where
[TABLE]
P^{\pi}$$\in\mathbb{R}^{|\mathcal{S}|\times|\mathcal{S}|} and R$$\in\mathbb{R}^{|\mathcal{S}|\times|\mathcal{A}|}, the corresponding elements of and are:
[TABLE]
TD Learning Temporal difference (TD) learning (?) is one of the most important methods to solve model-free RL (in which, we cannot get ). For the trajectory , TD learning is defined as,
[TABLE]
where is an estimate of , is step-size and is TD error. Let , if is
[TABLE]
above update (2) is algorithm (?). If is
[TABLE]
update (2) is (?), where . If is reduced to greedy policy, then reduces to (?).
Expected Sarsa The standard forward view of -return (?) of on-policy is defined as follows,
[TABLE]
where is -step return of , and . We can write recursively as follows (the detail is provided in Appendix A),
[TABLE]
Now, we introduce an unbiased 111 How to define the -return of for off-policy learning? Can we follow the way of (4) straightforwardly? Unfortunately, for the off-policy, the above idea cannot converge to . In fact, -step return of is sampled according to
Then according to (4), we define the -return of as follows,
which converges to . This is the fixed point of and it is a biased estimate of .
recursive -return of for off-policy learning,
[TABLE]
where is importance sampling. Eq.(6) firstly appears in (?; ?), but in which it is limited in function approximation. We develop (6) to be a general version which is conducive to the theoretical analysis of the following paragraph. The following Proposition 1 illustrates that (6) is an unbiased estimate of .
Proposition 1**.**
Let and be the behavior and target policy, respectively. For the -return (6), we have
[TABLE]
For the limitation of space, more discussions about -return of Sarsa, Eq.(5)-(6), and the proof of Proposition 1 are provided in Appendix A and B.
Expected Sarsa() with Control Variate
In this section, we firstly define () with control variate (we use ()- for short). Then, prove its linear convergence rate of ()- for policy evaluation. Finally, we analyze the variance of ()-.
ES()-CV Algorithm
We define () with control variate as follows
[TABLE]
where the additional term is called control variate (CV). The following fact
[TABLE]
implies that (7) extends (6) without introducing biases.
Theorem 1** (Forward View of ()-).**
Let denote the cumulated importance sampling from time to , and we use for convention. The recursive -return in Eq.(7) is equivalent to the following forward view: let be the TD error defined in (3), ,
[TABLE]
Proof.
See Appendix C. ∎
Remark 1**.**
Eq.(8) illustrates that for a given finite horizon trajectory , the total update (7) reaches
[TABLE]
which is off-line update of ()-.
Policy Evaluation
For policy evaluation, our goal is to estimate according to the trajectory collection , where , , and are dependent on the index strictly, and we omit coefficient to tight the expression without ambiguity.
The following -operator is a high level view of ()- (8), and it is helpful for us to introduce policy evaluation algorithm.
[TABLE]
where is defined in Eq.(1). We provide the equivalence (a) in Appendix D.
Theorem 2** (Policy Evaluation).**
For any initial , consider the trajectory generated by , and the following is generated according to the -th trajectory , ,
[TABLE]
By iterating over trajectories, the upper-error of policy evaluation is bounded by
[TABLE]
Proof.
See Appendix E. ∎
Remark 2**.**
The forward view (off-line update) of ()- (8) can be seen as sampled according to . For any , then , thus Eq.(13) implies (8) converges to at a linear convergence rate.
Variance Analysis
Theorem 3** (Variance Analysis of ()-).**
Consider a single trajectory with ffinite horizon , let , \mathbb{V}{\emph{ar}}\big{[}\widetilde{G}_{H+1}^{\lambda\rho,\emph{ES}}\big{]}=0. The variance of is given recursively as follows,
[TABLE]
where .
Proof.
See Appendix F. ∎
Now, let’s illustrate the significance of Eq.(14).
(I) It demonstrates total random sources lead to the variance. The first 3 terms reveal the variance of is cased by the following factors correspondingly: the error of one-step for policy evaluation, the error between and true value , and state-action transition randomness. The last term in (14) is the variance of future time.
(II) Please notice that if the CV term (in ) vanishes, i.e. , Eq.(14) is reduced to the recursive variance of (6). Thus, by Eq.(14), comparing the variance of with is equal to comparing the variance of .
Furthermore, if a good estimator of is available, the two following events happen:
For ()-, the term . Since for a proper estimate of , the following happens
[TABLE] 2. 2.
While, for (), , which is never be to [math], no matter how good an estimate of we achieve.
Thus, if a good estimator of is available, we have,
[TABLE]
Thus enjoys a lower variance than .
Numerical Analysis
We use an experiment to verify that CV is efficient to reduce variance of () for off-policy evaluation task. In this experiment, the target policy is greedy policy, the value of is selected by with -greedy policy, where is decayed as , . After 150 episodes, , and the value of target policy comes around . We use -greedy policy as behavior policy . All algorithms use step-size and .
Gradient Expected Sarsa()
In this section, we extend ()- with linear function approximation. Firstly, we prove the way to extend ()- with function approximation by (?) (section 12.9) is unstable. Then, we propose a convergent gradient ().
The Bellman equation (1) cannot be solved directly by tabular method for a large dimension of . We often use a parametric function to approximate where is a feature map. Then can be rewritten as a version of matrix where is a matrix whose row is . We assume that Markov chain induced by behavior policy is ergodic (?), i.e. there exists a stationary distribution such that , We denote as a diagonal matrix whose diagonal element is .
Instability of ES() with Function Approximation
A typical update to extend (8) has been presented in (?) (section 12.9),
[TABLE]
where is step-size, , is short for . Once the system (15) has reached a stable state, for any , the expected parameter can been written as
[TABLE]
where
[TABLE]
If the system (16) converges, then converges to the TD fixed point that satisfies
What condition guarantees the convergence of the (15)/ (16)? Unfortunately, the instability of (15) for off-policy is firstly realized by Sutton and Barto(?), but it is only an intuitive guess inspired by previous works. Now, we provide a simple but rigorous theoretical analysis to illustrate the divergence of Eq.(15). It is known that for on-policy learning , is a negative definite matrix (?). Thus, for on-policy learning, (15) converges to . However, for off-policy learning, since the steady state-action distribution does not match the transition probability and , which results in, there is no guarantee that is a negative definite matrix (?). Thus (15) may diverge.
An Unstable Example Now, we use a typical example (?) to illustrate the instability of iteration (15). The state transition of the example is presented in Figure 2. After some simple algebra (the detail is provided in Appendix G), we have . For any , a positive constant step-size , according to (16), we have
[TABLE]
For any , , is a positive scalar. Since then cannot be a negative matrix. Furthermore, according to (20),
[TABLE]
Convergent Algorithm
The above discussion of the instability for off-policy learning shows that we should abandon the way presented in (15). In this section, we propose a convergent gradient () algorithm.
We solve the problem by mean square projected Bellman equation (MSPBE) (?),
[TABLE]
where is an projection matrix. Furthermore, MSPBE can be rewritten as,
[TABLE]
where . The derivation of (22) is provided in Appendix H.
The computational complexity of the invertible matrix is at least (?), where is the dimension of feature space. Thus, it is too expensive to use gradient updates to solve the problem (22) directly. Besides, as pointed out in (?; ?), we cannot get an unbiased estimate of . In fact, since the update law of gradient involves the product of expectations, the unbiased estimate cannot be obtained via a single sample. It needs to sample twice, which is a double sampling problem. Secondly, cannot also be estimated via a single sample, which is the second bottleneck of applying stochastic gradient method to solve problem (22).
A practical way is converting (22) to be a convex-concave saddle-point problem (?). For , its convex conjugate (?) function is defined as
[TABLE]
By , we have Thus, (22) is equivalent to the next convex-concave saddle-point problem
[TABLE]
It is easy to see that if is the solution of problem (23), then . In fact, let , then . Taking into (23), then (23) is reduced to , which illustrates that the solution of (22) contained in (23). Gradient update is a natural way to solve problem (23) (ascending in and descending in ) as follows,
[TABLE]
where is step-size, .
Stochastic On-line Implementation However, since , and are versions of expectations, for model-free RL, we can not get the probability of transition. A practical way is to find the unbiased estimators of them. Let . By Theorem 9 in (?), we have
[TABLE]
Replacing the expectations in (24) and (25) by corresponding unbiased estimates, we define the stochastic on-line implementation of (24) and (25) as follows,
[TABLE]
More details are summarized in Algorithm 1.
Convergence Analysis
We measure the convergence rate of problem (23) by primal-dual gap error (?). Let
[TABLE]
the primal-dual gap error at each solution is
[TABLE]
Theorem 4** (Convergence of Algorithm 1).**
Consider the sequence generated by (27), step-size are positive constants. Let be the optimal solution of (23), , and we choose the step-size satisfy , where is operator norm. If parameter is on a bounded , i.e diam , diam D_{\omega}$$\leq\infty, is upper bounded by:
[TABLE]
Proof.
See Appendix I. ∎
Remark 3**.**
Theorem 4 illustrates (I) when , then the overall convergence rate of is , which reaches the worst rate of black box oriented sub-gradient methods (?); (II) when , a positive scalar, then
Related Works and Comparison
Liu et al.(?) firstly derives via convex-concave saddle-point formulation, and they prove the convergence rate reaches , where is Polyak-average: , . Their requires each is projected into the space . Later, Wang et al.(?) extends the work of Liu et al.(?), they suppose the data is generated from Markov processes rather than I.I.D assumption. Wang et al.(?) prove the convergence rate , the best convergence rate reaches , where the step-size satisfies , and is also Polyak-average, the same as (?). Besides, the of Wang et al.(?) also require projecting the parameter into the space .
Both Polyak-averaging and projection make the implementation of gradient TD learning more difficult. Comparing with (?; ?) , our removes Polyak-averaging and projection, while reaches a faster convergence rate.
Recently, (?) proves family (?; ?) converges at , but nerve reach , where . Nathaniel and Prashanth (?) proves (?) converges at with step-size , where . Then, Dalal et al.(?) further explores the property of , and they prove the convergence rate achieves , but never reach , where , is the minimum eigenvalue of the matrix .
Comparing to the all above works, we improve the optimal convergence rate to with a more relaxed step-size than theirs. Besides, although the / (?) reaches the same convergence rate as ours, their result depends on a decay step-size.
More details of the convergence rate of gradient temporal difference learning are summarized in Table 1.
Experiments
In this section, we employ three typical domains to test the capacity of for off-policy evaluation, Mountaincar, Baird Star (?), and Two-state MDP (?). We compare with the three state-of-art algorithms: (?), (?), (?). We choose the above three methods as baselines due to they are all learning by expected TD-error , which is the same as . For the limitation of space, we present some details of the experiments in Appendix J.
The Effect of Step-size
In this section, we verify the convergence result presented in Theorem 4/Remark 3. We use the empirical
[TABLE]
to evaluate the performance of all the algorithms, where we evaluate , , and according to their unbiased estimates by Monte Carlo method with 5000 episodes. Particular, for Mountaincar, to collect the samples, we run with features to obtain a stable policy. Then, we use this policy to collect trajectories that comprise the samples.
Figure 3 shows the comparison of the empirical MSPBE performance between a constant step-size and the decay step-size . Result (in Figure 3) illustrates that the with a proper constant step-size converges significantly faster than the learning with step-size , which support our theory analysis in Remark 3.
Comparison of Empirical MSPBE
The MSPBE distribution is computed over the combination of step-size, , and we set , for .. All the result showed in Figure 4 is an average of 100 runs.
Result in Figure 4 shows that our learns significantly faster with better performance than , and in all domains. Besides, converges with a lower variance. We also notice that although Touati et al(?) claim their reaches the same convergence rate as our , result in Figure 4 shows that our outperforms their siginificantly.
Comparison of Empirical MSE
We use the following empirical MSE according to (?),
[TABLE]
where is estimated by simulating the target policy and averaging the discounted cumulative rewards overs trajectories. The combination of step-size for MSE is the same as previous empirical MSPBE. All the result showed in Figure 5 is an average of 100 runs and we set , for .
The result in Figure 5 shows that converges significantly faster than all the three baselines with lower variance in Mountaincar domain. For the Two-state MDP and Baird domain, also achieves a better performance. This conclusion further verifies the effectiveness of the proposed .
Conclusion
In this paper, we introduce control variate technique to () and propose algorithm. We analyze all the random sources lead to the variance of . We prove that if a good estimator of value function achieves, the enjoys a lower variance than Expected Sarsa() without control variate. Then, we extend to be a convergent algorithm with function approximation and propose algorithm. We prove that the convergence rate of achieves , which matches or outperforms several state-of-art gradient-based algorithms, but we use a more relaxed step-size. Finally, we use numerical experiments to demonstrate the effectiveness of the proposed algorithm. Results show that the proposed algorithm converges faster and with lower variance than three typical algorithms (), () and ().
Appendix A Appendix A: -Return of Sarsa for Off-policy Learning
For the discussion of off-policy learning, we need the background of importance sampling. Thus, the basic common conclusion about importance sampling (IS) and pre-decision importance sampling (PDIS) (?) is necessary.
Off-Policy Learning via Importance Sampling
Usually, we require that every action taken by is also taken by , which is often called coverage (?) in reinforcement learning.
Assumption 1** (Coverage).**
, we require that .
The difficulty of off-policy roots in the discrepancy between target policy and behavior policy —-we want to learn the target policy while we only get the data generated by behavior policy. One technique to hand this discrepancy is importance sampling (IS) (?). Let be a trajectory with finite horizon . Let denote the cumulated importance sampling ratio, where and . Let , under Assumption 1 the IS estimator is a unbiased estimation of . However, it is known that IS estimator suffers from large variance of the product (?). Pre-decision importance sampling (PDIS) (?) is a practical variance reduction method without introducing bias, i.e. .
[TABLE]
For the equation , please see(?) or section 5.9 in (?).
Lemma 1** (Section 3.10, (?); Section 5.9, (?)).**
Let be the trajectory generated by behavior policy , for a given policy and under Assumption 1, the following holds,
[TABLE]
Lemma 1 implies that for any time , the importance sampling factors after have no effect in the expectation, thus the following holds: for all ,
[TABLE]
-Return of Sarsa
The -return (?) is an average contains all the -step return by weighting proportionally to , . For example, let be -step return, then the standard forward view of Sarsa is , which is equivalent to the following recursive version
[TABLE]
We only discuss the case of off-policy learning. On-Policy is a particular case of off-policy learning if . One version of -return of off-policy Sarsa via importance sampling is defined as the following recursive iteration (Section 12.8, (?)):
[TABLE]
The next Proposition 2 gives a forward view of Eq.(30), and is an unbiased estimate of .
Proposition 2**.**
Let be behavior policy and be the target policy. , , and , then is equivalent to defined in Eq.(30). Furthermore, .
Proof.
We restate the complete calculation process of off-policy -return as belowing
[TABLE]
The last Eq.(33) implies that from the definition of standard -return Eq.(31) and Eq.(32), we can get the recursive form of Eq.(30).
Expanding Eq.(31), we get the complete -step return as follows
[TABLE]
By Eq.(28) and Eq.(29), we have
[TABLE]
thus, ∎
Appendix B Appendix B: Proof of Eq.(5) and Proposition 1
Eq.(5): Recursive -Return of Expected Sarsa for On-policy Case
In this section, we prove (I) the forward view of Eq.(5); (II) Eq.(5) is an unbiased estimate of .
Let where is -step return of Expected Sarsa and , then can be written recursively as: Besides, *
Proof.
By the definition of -step return of Expected Sarsa: , then can be written as the following recursive form:
[TABLE]
Now, we turn to analyses :
[TABLE]
which is the result in Eq.(5).
For on-policy learning, the following is obvious
[TABLE]
It is similar to the Eq.(35), we have
[TABLE]
which implies is an unbiased estimate of . ∎
Proof of Proposition 1
Proposition 1 * Let and be the behavior and target policy, respectively. Consider the -return of Sarsa and Eq.(6), then *
Proof.
We expand as follows
[TABLE]
where Eq.(41) holds by the following facts: recall , thus
[TABLE]
If we continue to expand Eq.(42), then we have
[TABLE]
∎
Appendix C Appendix C: Proof of Theorem 1
Theorem 1 (Forward View and Variance Analysis of Expected Sarsa with Control Variate) * Let and denote the behavior and target policy, respectively. The -return with control variate defined in Eq.(7) is equivalent to the following forward view: let ,*
[TABLE]
Proof.
Firstly, we prove Eq.(43),(44) is equivalent to Eq.(7). Let’s expand (in Eq.(44)),
[TABLE]
the last Eq.(46) implies
[TABLE]
which is the Eq.(7) ∎
Appendix D Appendix D: Proof of Eq.(11)
The Equivalence (a) for Eq.(11)
Proof.
[TABLE]
Eq. (48) is a common result in RL, the details of please refer to (?) or Section 6.3.9 in (?). ∎
Appendix E Appendix E: Proof of Theorem 2
Theorem 2 (Policy Evaluation) * For any initial , consider the sequential trajectory collection , and the following is learned according to the -th trajectory , ,*
[TABLE]
By iterating over trajectories, the error of policy evaluation is upper bounded by
[TABLE]
Proof.
(Proof of Theorem 2) By Eq.(11), the following equation holds (?; ?),
[TABLE]
It is known that Bellman operator is a -contraction (?),
[TABLE]
Thus we have
[TABLE]
Since , Eq.(50) implies that is a -contraction. By Banach fixed point theorem (?), generated by converges to the fixed point of .
By Eq.(11), is the unique fixed point of . Thus, converges to .
Now, we turn to consider the convergence rate. According to (50), it is easy to see , Then, ,
[TABLE]
let , we have
[TABLE]
∎
Appendix F Appendix F: Proof of Theorem 3
Theorem 3 * is an unbiased estimator of , whose variance is given recursively as follows,*
[TABLE]
*where , . *
Lemma 2**.**
The expectation of the cross-term between the TD error at and the difference between the return and value at is zero: for any , i.e., satisfying the Bellman equation, for any bounded function ,
[TABLE]
A similar result of state value function appears in (?), and Lemma 2 expends it to state-action value function. Thus,we omit its proof, and for the details please refer to (?).
Remark 4**.**
If is replaced by Expected Sarsa estimator , Eq.(51) holds.
Proof.
(Proof of Theorem 3)
[TABLE]
Eq.(52) holds due to Remark 4 and Lemma 1 in (?). By the definition of variance, Eq.(52) is equivalent to Eq.(14), which is the result we want to prove. ∎
Appendix G Appendix G: Two-State MDP Example
[TABLE]
then, we have
[TABLE]
[TABLE]
Appendix H Appendix H: Proof of Eq.(22)
For a given policy , , then by the definition of MSPBE objection function, we have,
[TABLE]
where
Appendix I Appendix I: Proof of Theorem 4
Theorem 4 * Consider the sequence generated by (27), step-size are positive constants. Let , and we chose the step-size satisfy , where is operator norm. If parameter is on a bounded , i.e diam , diam D_{\omega}$$\leq\infty, is upper bounded by:*
[TABLE]
The proof of Theorem 4 uses a inequality (in Eq.(55)) , we present it in the next Proposition 3.
Proposition 3**.**
Consider the update of expection version in Eq.(25),
[TABLE]
Let , then for any , the following hlods
[TABLE]
Proof.
(Proof of Proposition 3) Let sub-gradients of at be denoted as , . By the definition of sub-gradient , we have Since is convex, then for any the following holds
[TABLE]
By the law of cosines: , we have
[TABLE]
summing them implies the following inequality,
[TABLE]
which is we want to prove. ∎
Proof.
(Proof of Theorem 4) Let ,. then for any :
[TABLE]
By the inequality in Proposition 3, we have
[TABLE]
Summing the Eq.(56) from
[TABLE]
By the Cauchy-Schwarz inequality , we have
[TABLE]
then the following holds, for any :
[TABLE]
Let , and we chose the step-size satisfy . By the convexity of and , then we deduce from (57):
[TABLE]
By Eq.(58), we have
[TABLE]
∎
Appendix J Appendix J: Details of Experiments
MountainCar Since the state space of mountaincar domain is continuous, we use the open tile coding software http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/RLtoolkit/tilecoding.html to extract feature of states.
In this experiment, we set the number of tilings to be 4 and there are no white noise features. The performance is an average 5 runs and each run contains 5000 episodes. We set , . The MSPBE/MSE distribution is computed over the combination of step-size, , and . Following suggestions from Section10.1 in (?), we set all the initial state-action values to be 0, which is optimistic to cause extensive exploration.
Baird Example The Baird example considers the episodic seven-state, two-action MDP. The action takes the system to one of the six upper states with equal probability, whereas the action takes the system to the seventh state. The behavior policy selects the and actions with probabilities and , so that the next-state distribution under it is uniform (the same for all nonterminal states), which is also the starting distribution for each episode. The target policy always takes the solid action, and so the on-policy distribution (for ) is concentrated in the seventh state. The reward is zero on all transitions. The discount rate is . The feature and are defined as follows,
[TABLE]
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[A. Tamar and Mannor. 2016] A. Tamar, D. D., and Mannor., S. 2016. Learning the variance of the reward-to-go. The Journal of Machine Learning Research 17(13):1––36.
- 2[Adam and White 2016] Adam, A., and White, M. 2016. Investigating practical linear temporal difference learning. In International Conference on Autonomous Agents & Multiagent Systems , 494–502.
- 3[Baird 1995] Baird, L. 1995. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995 . Elsevier. 30–37.
- 4[Balamurugan and Bach 2016] Balamurugan, P., and Bach, F. 2016. Stochastic variance re- duction methods for saddle-point problems. In Advances in Neural Information Processing Systems , 1416––1424.
- 5[Bertsekas 2009] Bertsekas, D. P. 2009. Convex optimization theory . Athena Scientific Belmont.
- 6[Bertsekas 2012] Bertsekas, D. P. 2012. Dynamic Programming and Optimal Control , volume 2. Athena scientific Belmont, MA.
- 7[Dalal et al . 2018 a] Dalal, G.; Szorenyi, B.; Thoppe, G.; and Mannor, S. 2018 a. Finite sample analyses for td(0) with function approximation. In AAAI 2018 .
- 8[Dalal et al . 2018 b] Dalal, G.; Szorenyi, B.; Thoppe, G.; and Mannor, S. 2018 b. Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning. In Annual Conference on Learning Theory (COLT) .
