Gauss-Newton Temporal Difference Learning with Nonlinear Function Approximation
Zhifa Ke, Junyu Zhang, Zaiwen Wen

TL;DR
None
Contribution
None
Abstract
In this paper, a Gauss-Newton Temporal Difference (GNTD) learning method is proposed to solve the Q-learning problem with nonlinear function approximation. In each iteration, our method takes one Gauss-Newton (GN) step to optimize a variant of Mean-Squared Bellman Error (MSBE), where target networks are adopted to avoid double sampling. Inexact GN steps are analyzed so that one can safely and efficiently compute the GN updates by cheap matrix iterations. Under mild conditions, non-asymptotic finite-sample convergence to the globally optimal Q function is derived for various nonlinear function approximations. In particular, for neural network parameterization with relu activation, GNTD achieves an improved sample complexity of , as opposed to the sample complexity of the existing neural TD methods. An…
| Over-paramete- -rized Linear | Neural | Smooth | |
| Total iterations in population update | |||
| TD | – – | ||
| GNTD | |||
| Sample complexity in stochastic update | |||
| TD | – – | ||
| GNTD | |||
| BC | BCQ | CQL | TD3+BC | TD3+BC(Ours) | GNTD3+BC(Ours) | |
| Halfcheetah-medium | 42.4 0.2 | 47.2 0.4 | 37.2 | 42.8 | 46.5 17.6 | 56.7 0.3 |
| Hopper-medium | 30.1 0.3 | 34.0 3.8 | 44.2 | 99.5 | 100.2 0.2 | 100.5 0.3 |
| Walker2d-medium | 12.6 3.1 | 53.3 9.1 | 57.5 | 79.7 | 79.4 1.6 | 81.1 1.3 |
| Halfcheetah-medium-replay | 34.5 0.8 | 33.0 1.7 | 41.9 | 43.3 | 42.2 0.8 | 42.6 0.4 |
| Hopper-medium-replay | 20.0 3.1 | 28.6 1.1 | 28.6 | 31.4 | 31.8 2.0 | 32.4 1.3 |
| Walker2d-medium-replay | 8.1 1.3 | 11.5 1.3 | 15.8 | 25.2 | 23.9 1.6 | 24.8 2.1 |
| Halfcheetah-medium-expert | 70.6 7.1 | 84.4 4.5 | 27.1 | 97.9 | 90.9 3.6 | 98.2 3.3 |
| Hopper-medium-expert | 92.5 15.1 | 111.4 1.2 | 111.4 | 112.2 | 111.9 0.4 | 112.0 0.1 |
| Walker2d-medium-expert | 12.2 3.3 | 50.7 7.2 | 68.1 | 101.1 | 94.6 15.5 | 104.5 4.4 |
| Halfcheetah-expert | 104.9 1.4 | 96.6 2.8 | 82.4 | 105.7 | 103.9 1.7 | 107.6 0.6 |
| Hopper-expert | 111.3 0.9 | 108.7 5.1 | 111.2 | 112.2 | 112.3 0.1 | 112.2 0.5 |
| Walker2d-expert | 58.1 9.1 | 92.6 5.0 | 103.8 | 105.7 | 105.2 1.8 | 107.6 1.5 |
| Total | 597.3 37.7 | 752.0 43.2 | 728.9 | 956.7 | 942.8 46.9 | 980.216.1 |
| Data set | TD | DQN | GNTD | GNDQN |
| CartPole-rep | 757.93 | 0.6 | 3.03 | 0.51 |
| CartPole-med-rep | 97.78 | 0.66 | 1.51 | 0.58 |
| Acrobot-rep | 1.32 | 0.63 | 0.71 | 0.52 |
| Acrobot-med-rep | 1.41 | 0.68 | 0.68 | 0.58 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Advanced Algorithms and Applications · Machine Learning and ELM
MethodsDense Connections · Q-Learning · Convolution · Deep Q-Network
Provably Efficient Gauss-Newton Temporal Difference Learning Method
with Function Approximation
Zhifa Ke
Zaiwen Wen
Junyu Zhang
Abstract
In this paper, based on the spirit of Fitted Q-Iteration (FQI), we propose a Gauss-Newton Temporal Difference (GNTD) method to solve the Q-value estimation problem with function approximation. In each iteration, unlike the original FQI that solves a nonlinear least square subproblem to fit the Q-iteration, the GNTD method can be viewed as an inexact FQI that takes only one Gauss-Newton step to optimize this subproblem, which is much cheaper in computation. Compared to the popular Temporal Difference (TD) learning, which can be viewed as taking a single gradient descent step to FQI’s subproblem per iteration, the Gauss-Newton step of GNTD better retains the structure of FQI and hence leads to better convergence. In our work, we derive the finite-sample non-asymptotic convergence of GNTD under linear, neural network, and general smooth function approximations. In particular, recent works on neural TD only guarantee a suboptimal sample complexity, while GNTD obtains an improved complexity of . Finally, we validate our method via extensive experiments in both online and offline RL problems. Our method exhibits both higher rewards and faster convergence than TD-type methods, including DQN.
Machine Learning, ICML
1 Introduction
In this paper, we consider the policy evaluation problem, namely, the problem of evaluating state action value function (Q function). This is a fundamental building block of many popular Reinforcement Learning (RL) algorithms, including policy improvement method (Sutton et al., 1999), trust region policy optimization (Schulman et al., 2015) and the actor-critic algorithms (Konda & Tsitsiklis, 1999; Lillicrap et al., 2015; Fujimoto et al., 2018). A properly evaluated Q function often greatly boosts the performance of these methods. For modern RL with enormous state and action spaces, appropriately parameterizing the Q function with certain function approximation is crucial to the scalability of RL algorithms, common examples include linear (Bhandari et al., 2018; Zou et al., 2019; Srikant & Ying, 2019) and neural (Cai et al., 2019; Xu & Gu, 2020; Agazzi & Lu, 2022) function approximations.
Let be the unknown state action value function to be estimated under policy and let be the corresponding Bellman operator. Then the standard formulation for policy evaluation with function approximation is to minimize the Mean-Squared Bellman Error (MSBE):
[TABLE]
where is the stationary distribution of the state-action pairs under the policy . A direct optimization of MSBE can be very hard, as the double sampling issue denies an unbiased estimator of . A practical technology to address the double sampling issue is to twist the loss function by introducing a target parameter as follows
[TABLE]
One popular method for solving (1) is the Temporal Difference (TD) (Sutton, 1988) learning algorithm and its variants (Bradtke & Barto, 1996; Sutton et al., 2009a, b; Tu & Recht, 2018). Given the surrogate loss and the target parameter , the vanilla TD step updates by
[TABLE]
where is an unbiased estimator of . Since only contains a part of , TD is also termed as the semi-gradient method. Though TD-type algorithms have been empirically successful in many numerical applications (Mnih et al., 2013; Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018), the semi-gradient nature of the TD-type methods has mostly limited their theoretical convergence guarantees to linear function approximations (Sutton et al., 2009a, b). Recent results on neural TD (Cai et al., 2019; Xu & Gu, 2020) typically require a suboptimal sample complexity to find some s.t. , where is the optimal solution. For general smooth function approximation, except for a few variants (Maei et al., 2009) where only asymptotic convergence is obtained, TD-type methods have divergence issues both in theory and practice (Tsitsiklis & Van Roy, 1996; Maei et al., 2009; Achiam et al., 2019; Brandfonbrener & Bruna, 2019).
Another popular approach for policy evaluation is the Fitted Q-Iteration (FQI) method (Riedmiller, 2005; Chen & Jiang, 2019; Fan et al., 2020), which repeatedly solves a nonlinear least square subproblem to fit the Q-iteration:
[TABLE]
by sufficiently many stochastic gradient steps. Under expressive enough function approximation class such as over-parameterized neural networks, the FQI can largely approximate the contractive Q-iteration and hence retain desirable convergence properties. However, the need to solve the subproblem leads to an expensive per-iteration computation, which makes FQI less competitive against TD in practice.
Interestingly, each iteration of TD method can be viewed as an inexact FQI that takes only one stochastic gradient descent step to optimize the nonlinear least square subproblem. The inexactness here can be an interpretation of the divergence behavior of TD learning for general function approximations. This motivates us to design the Gauss-Newton Temporal Difference (GNTD) learning algorithm, which takes one Gauss-Newton step to optimize FQI’s subproblem. The proposed method lies exactly in between FQI and TD, which has cheaper per-iteration computation compared to FQI, and better theoretical convergence compared to TD. In this paper, we provide a complete finite-time convergence analysis of GNTD under linear functions, neural network functions and general smooth functions, for both population and random sampling cases. The detailed sample complexities of GNTD are summarized in Table 1.
1.1 Contributions
Our contributions are summarized as follows.
- •
We propose the Gauss-Newton Temporal Difference (GNTD) learning algorithm as an inexact FQI method with Gauss-Newton steps. We also design a practically efficient implementation of GNTD based on damping and the K-FAC method.
- •
We derive convergence and sample complexity analysis of GNTD method with both population and stochastic updates, as summarized in Table 1. Compared to the existing results of TD method, GNTD achieves better sample complexities under linear, neural network and general smooth function approximations. In particular, for over-parameterized neural network approximation, GNTD achieves an improved sample complexity as opposed the complexity of neural TD.
We also conduct extensive experiments to numerically validate the efficiency of GNTD. For both continuous and discrete tasks, our method outperforms TD and its variants, including DQN. Besides the online setting that is theoretically analyzed in this paper, interestingly, GNTD also exhibits advantageous performance in offline RL tasks compared to several state-of-the-art benchmarks.
1.2 Related Work
TD learning was first proposed for policy evaluation and Q-learning (Sutton, 1988), and it later on developed numerous variants, including Gradient TD (Sutton et al., 2009a, b), Least-squares TD (Bradtke & Barto, 1996; Boyan, 2002; Ghavamzadeh et al., 2010; Tu & Recht, 2018), and DQN (Mnih et al., 2013), etc. However, the semi-gradient nature makes the convergence of TD-type methods highly non-trivial. We list the convergence and complexity results of TD-type methods that are most relevant to our work.
Asymptotic Analysis. There are extensive results on the asymptotically convergent analysis of linear TD (Jaakkola et al., 1993; Tsitsiklis & Van Roy, 1996; Perkins & Pendrith, 2002; Borkar, 2009). However, analyzing the nonlinear TD is always challenging. In fact, the nonlinear TD often diverges in practice (Tsitsiklis & Van Roy, 1996; Maei et al., 2009; Achiam et al., 2019; Brandfonbrener & Bruna, 2019).
Finite-time Analysis. The non-asymptotic sample complexities of linear TD are recently analyzed in (Bhandari et al., 2018; Dalal et al., 2018a; Zou et al., 2019). Regarding its variants (Gradient TD and Least-squares TD), finite-time analyzes are also established in (Dalal et al., 2018b; Touati et al., 2018; Liu et al., 2020) and (Lazaric et al., 2010; Prashanth et al., 2014; Tagorti & Scherrer, 2015), respectively. However, such a reformulation leads to bi-level optimization, which is difficult to extend to nonlinear Q-learning and lacks stability in practice (Pfau & Vinyals, 2016). For nonlinear neural TD methods (Cai et al., 2019; Brandfonbrener & Bruna, 2019; Xu & Gu, 2020), the key observation is that wide over-parameterized neural networks are approximately linear under the Neural Tangent Kernel (NTK) regime (Du et al., 2018; Zhang et al., 2019; Allen-Zhu et al., 2019). However, existing neural TD method only achieves the sub-optimal sample complexities.
Besides TD learning, a recent important development for policy evaluation with function approximation is the FQI method (Riedmiller, 2005; Chen & Jiang, 2019; Fan et al., 2020). As long as the function class is expressive enough, and the nonlinear least square fits the Q-iteration well, FQI will always converge to desirable solutions.
2 Gauss-Newton Temporal Difference Learning
2.1 Preliminaries
We consider the infinite-horizon discounted Markov decision process (MDP) , with state space , action space , reward function , transition probability , and a discount factor . Let policy be a mapping that returns a probability distribution over the action space , for any state . Then the state-action value function (Q-function) under policy is
[TABLE]
For any mapping , we denote as the Bellman operator:
[TABLE]
where
2.2 The GNTD Method
Recall the nonlinear least square subproblem of FQI:
[TABLE]
Unlike FQI that solves this problem with sufficiently many SGD steps, and unlike TD that optimizes the loss with merely one SGD step, we would like to propose a method lying between FQI and TD methods. Denote as the Jacobian matrix of the parameterized Q-function . Then the GNTD method linearizes the in the FQI subproblem and updates the iterates by
[TABLE]
The intuition behind the GNTD update is very clear. If spanned by the columns of the Jacobian is expressive enough s.t. Under proper conditions, informally, one also has
[TABLE]
Combining the above two inequalities yields
[TABLE]
Then the contraction of Bellman operator further yields \|Q(\theta^{K})-Q^{\pi}\|_{\infty}\leq(1-(1-\gamma)\beta)^{K}\|Q(\theta^{0})-Q^{\pi}\|_{\infty}+\mathcal{O}\big{(}\frac{\beta}{1-\gamma}\big{)}. Therefore, setting or adopting a diminishing sequence of will provide the convergence of GNTD in this ideal situation.
For each tuple where is the distribution where , , , , we define the loss function
[TABLE]
where is the TD error w.r.t. the tuple . By removing constant terms, the GNTD subproblem for in (3) can be rewritten in a more sample-friendly form:
[TABLE]
Define the curvature matrix and the semi-gradient as
[TABLE]
Then (4) has a closed form solution . Note that the solution is the Gauss-Newton direction for solving the nonlinear system (Wright et al., 1999), and the semi-gradient is exactly the expected TD direction. Hence we call our method the Gauss-Newton Temporal Difference (GNTD) method.
Note that the population update (3) is not practically implementable. We introduce an empirical version of (4), with an additional quadratic damping term to improve robustness:
[TABLE]
where is a batch of samples from . Then (6) leads to the empirical GNTD update:
[TABLE]
as detailed in Algorithm 1. For large-scale function approximation class such as over-parameterized neural networks, the matrix inversion in (7) can be expensive. In this case, we provide a practically efficient implementation of GNTD using the Kronecker-factored Approximate Curvature (K-FAC) method. Please see the details in the Appendix.
3 Convergence Analysis
In this section, we provide a finite-time convergence analysis of GNTD method under linear, neural network, and general smooth function approximations for both population and stochastic updates. The detailed proof can be found in Appendices C and D.
Assumption 3.1**.**
The data distribution is the stationary distribution over the state-action pairs under the policy .
We write as an -dimensional diagonal matrix, whose -th diagonal element is . Denote , and . Then Assumption 3.1 indicates that is a contraction w.r.t. (Tsitsiklis & Van Roy, 1996), that is,
[TABLE]
For the ease of the notation, we will denote as in later discussion. Instead of an matrix, we view as an column vector, with being a multi-index arranged in the lexicographical order.
3.1 Iteration Complexity for Population Update
3.1.1 Linear Approximation
Consider the linear function approximation with :
[TABLE]
Let be the collection of all feature vectors, whose -th row equals . Without loss of generality, we assume for any . Then the minimizer of the MSBE (1) satisfies the projected Bellman equation
[TABLE]
where is the projection to the subspace under the inner product . We denote as the optimal linear approximator.
In the under-parameterized regime where , we make the following assumption for the feature covariance matrix, in accordance with (Bhandari et al., 2018).
Assumption 3.2**.**
Define the feature covariance matrix as . We assume and denote as its minimum eigenvalue.
Then the next theorem provides the convergence rate of linear GNTD in the under-parameterized regime, which matches the standard rate of linear TD under the same assumption (Bhandari et al., 2018).
Theorem 3.3**.**
Suppose Assumptions 3.1 and 3.2 hold. If we set in the population GNTD update (3), then
[TABLE]
If the true Q-function is realizable, namely,
[TABLE]
then the intrinsic error term in Theorem 3.3 diminishes and the estimated Q-functions converges linearly to the true Q-function .
Next, as a warm up for over-parameterized neural network function approximation, we will analyze the linear GNTD in the over-parameterized regime where . In this case, the Q-function is always realizable and (8) holds true. However, we should also notice that, Assumption 3.2 will no longer hold since the feature covariance matrix is at most rank while the matrix dimension can be much larger. As a result, the standard convergence rate of population linear TD decays to sublinear convergence (Bhandari et al., 2018). On the contrary, the population version of GNTD will still be able to retain a linear convergence rate under very mild condition.
Recall that is a diagonal matrix. Since is no longer positive definite, further discussion is needed. Let , and we introduce the following assumption.
Assumption 3.4**.**
The -weighted Gram matrix and we denote as its minimum eigenvalue.
Let be the support of the distribution . Then Assumption 3.4 is equivalent to requiring that the feature vectors are linearly independent and the support covers the full . In general, is a very strong assumption. However, it is not necessary for our analysis. For a policy , under mild condition that the state transition Markov chain is aperiodic and irreducible, then iff. . Then the Bellman equation is closed on since it does not involve for :
[TABLE]
This allows us to only care about the MSBE on :
[TABLE]
where . Therefore, Assumption 3.4 can actually be relaxed to requiring the -weighted Gram matrix on to be positive definite. Though we will have no guarantee for for in this case, it is not an issue. For example, when policy evaluation is applied as a built-in module of actor-critic methods, the Q-value outside will never appear in the policy gradient formula. Therefore, throughout this paper, we will assume for the ease of discussion.
Theorem 3.5**.**
Consider the over-parameterized linear GNTD under Assumptions 3.1 and 3.4. If we set in the population update (3), then
[TABLE]
Due to the absence of a positive definite feature covariance matrix in the over-parameterized regime, existing results of linear TD (Bhandari et al., 2018) requires \mathcal{O}\big{(}\frac{1}{(1-\gamma)^{2}\varepsilon^{2}}\big{)} iterations to guarantee , while GNTD manages to obtain a linear convergence that only requires iterations for finding -accurate solutions.
3.1.2 Neural Network Approximation
Compared to the over-parameterized linear approximation, a more natural way to parameterize the Q-function is the neural network. Consider a two-layer neural network:
[TABLE]
where is the feature mapping, are the weight matrices, and is the Relu activation. Similar to linear function approximation, we assume for any . We denote the value of in iteration . For each , we initialize the weights and . The parameter will not be trained during the optimization. According to (Du et al., 2018), we make the following assumption to ensure the positive definiteness of the -weighted Gram matrix.
Assumption 3.6**.**
For all pairs , we assume . Moreover, We assume to simplify the discussion.
Similar to Assumption 3.4, the requirement that the support covers the whole can be relaxed by only considering the MSBE on . Other settings in Assumption 3.6 imply independence and boundedness between two feature vectors. The next theorem establishes the global convergence rate of neural GNTD when it follows population update.
Theorem 3.7**.**
Suppose Assumptions 3.1 and 3.6 hold. If we set in the population update (3), and the network width , , then w.p. , we have
[TABLE]
For over-parameterized neural networks, the feature covariance matrix will always be rank-deficient while the Gram matrix can be positive definite under proper initialization. Effectively exploiting this property significantly distinguishes the analysis of GNTD from the existing TD-type methods with neural approximation. As a result, to find an -accurate Q-function approximator, GNTD needs iterations, while iterations are required by neural TD (Cai et al., 2019).
3.1.3 General Smooth Function Approximation
In this section, we consider the smooth function approximation that satisfies the following properties.
Assumption 3.8**.**
For all , the function is uniformly bounded by , -Lipschitz, and -smooth:
[TABLE]
[TABLE]
Analogous to Assumption 3.4, we make the following requirement for the Jacobian matrix .
Assumption 3.9**.**
such that for any , we have , where is diagonal.
When is linear, Assumption 3.9 reduces to Assumption 3.4. It actually implies that the objective function (cf. (4)) of the subproblem of the population update (3) is -strongly convex. Let us define the worst-case optimal fitting error of over the parameter space of as
[TABLE]
then we have the following theorem.
Theorem 3.10**.**
Suppose Assumptions 3.1, 3.8, and 3.9. Then for population update (3), there exists a constant independent of and such that
[TABLE]
If we set , then after K=\mathcal{\mathcal{O}}\Big{(}\frac{1}{(1-\gamma)^{2}\varepsilon}\log\frac{1}{\varepsilon}\Big{)} iterations.
3.2 Sample Complexity for Stochastic Update
In this section, we analyze the convergence and complexity result of the more practical stochastic GNTD method (7), under linear, neural network, and general smooth function approximations. In this scheme, each iteration is constructed by sampling a batch of data tuples from distribution .
3.2.1 Linear Approximation
We start with the under-parameterized linear approximation case, for which the convergence and complexity is well-studied for TD method under Assumption 3.2. In this situation, the main technique for establishing a finite sample convergence for stochastic GNTD is to exploit the concentration inequality and analyze the proposed method as the mean path population update with controllable errors. In particular, to facilitate the application of concentration inequality, we will require the iteration sequence to be bounded. In fact, Assumption 3.2 indicates that
[TABLE]
Therefore, by exploiting the fast convergence of GNTD we can inductively provide a uniform bound for with high probability while simultaneously establishing the convergence of . As a result, we derive the following theorem for stochastic GNTD.
Theorem 3.11**.**
Suppose Assumptions 3.1 and 3.2 hold. For any , if we set , the damping rate and the sample size for each iteration, then Algorithm 1 satisfies
[TABLE]
w.p. , where is a constant.
Suppose the Q-function is realizable, then setting in Theorem 3.11 yields . Such a complexity complexity matches the result of stochastic linear TD (Bhandari et al., 2018).
Next, we consider the over-parameterized case. In this scenario, we still need a uniform bound of the iteration sequence to facilitate the concentration inequality. However, it is not straightforward due to the lack of (11), which is based on Assumption 3.2. Fortunately, on the one hand, Assumption 3.4 enables us to bound with . On the other hand, a fast convergence of in GNTD can further indicates a fast convergence of . Therefore, we can inductively provide a uniform bound for by while proving a fast convergence of .
Theorem 3.12**.**
Consider the over-parameterized linear GNTD under Assumptions 3.1 and 3.4. For any , if we set and the sample size for each iteration, then the output of Algorithm 1 satisfies
[TABLE]
w.p. , where is a constant.
In the absence of Assumption 3.2, the existing analysis of linear TD (Bhandari et al., 2018) only provides a sub-optimal sample complexity. On the contrary, our GNTD method can guarantee an sample complexity by utilizing the over-parameterizaiton structure that allows Assumption 3.4, which yields a significantly advantage over TD method.
3.2.2 Neural Network Approximation
Now let us proceed to the discussion of GNTD on neural network approximation. By utilizing a similar approach to over-parameterized linear GNTD, one can prove a uniform bound on the iteration sequence when is parameterized by the neural network (9).
Theorem 3.13**.**
Suppose Assumptions 3.1 and 3.6 hold. If we set and the network width for each iteration , then the output of Algorithm 1 satisfies
[TABLE]
w.p. , where is some constant.
By setting and , Theorem 3.13 indicates an sample complexity of neural GNTD, which is much better than the sample complexity of neural TD (Cai et al., 2019).
3.2.3 General Smooth Function Approximation
For smooth functions, the key is to bound the gap between the solutions of (4) and (6). There are many ways to deal with (6), including Empirical Risk Minimization (Shalev-Shwartz et al., 2009) and ProxBoost (Davis et al., 2021). In particular, ProxBoost provides the sate-of-the-art sample complexity w.r.t. the failure probability and the problem condition number. Therefore, we will adopt ProxBoost to enhance the solution to the ERM subproblem (6).
Theorem 3.14**.**
Suppose Assumptions 3.1, 3.8, and 3.9 hold. If we set , then w.p. , the output of Algorithm 1 satisfies
[TABLE]
for some constants .
To our best knowledge, there is no explicit finite-time convergence analysis of TD for smooth functions. By choosing the damping rate and the step size , stochastic GNTD can output an -accurate Q-function approximator with iterations and in total samples.
4 Experiments
Finally, we conduct a series of experiments over the OpenAI Gym (Brockman et al., 2016) tasks and demonstrate the efficiency of GNTD method under a variety of settings.
In details, we first examine the advantage of GNTD over TD in on-policy reinforcement learning setting, where the policy evaluation serves as a built-in module of the policy iteration method. Second, we also consider a few offline RL tasks, where we extend the proposed method to the Q-learning settings. All the compared learning algorithms are trained without exploration. We compare the performance of different algorithms in terms of the Bellman error and the final return.
4.1 Policy Optimization with GNTD Method
First, we present the experiments where GNTD and TD are executed as built-in modules of an entropy regularized (Haarnoja et al., 2018) policy iteration method. Typically, policy iteration is divided into two steps: policy evaluation and policy improvement. In details, given an initial policy , our agents collect a data batch and then perform a 25-step policy evaluation to obtain , by either GNTD or TD method. Then, we take a 1-step policy gradient (PG) ascent to the entropy regularized total reward:
[TABLE]
where is the entropy of the density function . Then the agent execute the new policy to collect a new data batch, and loop through policy optimization until convergence.
For the policy function and state-action value function , we employ two layer neural networks. For computational efficiency, we implement the GNTD alorithm with K-FAC method (see Appendix A). We set the damping rate and the learning rate of the -function.
Figure 1 shows the experimental results under several on-policy OpenAI gym environments. PG-GNTD and PG-TD refer to the policy optimization based on GNTD and TD, respectively. It can be observed that PG-GNTD converges faster than PG-TD in Hopper, Walker2d and Swimmer environments. It also fetches higher final rewards in all tasks.
4.2 Offline Reinforcement Learning Tasks
In this section, we present the experiments for both discrete and continuous offline RL tasks, where we will focus on optimizing rather than evaluating the policy. We compare the performance of our method against several benchmarks in terms of the Bellman error and average return.
4.2.1 Discrete Action Tasks
In this experiment, we present the experimental results under the OpenAI Gym CartPole-v1 and Acrobot-v1 environments. The tested algorithms includes TD, DQN (Mnih et al., 2013), GNTD, and GNDQN. In particular, both TD and GNTD are adapted to the Q-learning setting where the Bellman operator is replaced with optimal Bellman operator. GNDQN is a variant of GNTD method that incorporates a DQN-style momentum update to the target network, while taking Gauss-Newton steps to update the weight matrices. Furthermore, the four algorithms use the same neural network architecture and has the same learning rate of . The size of all offline datasets is chosen as , and we set the damping rate to be .
Compared to the on-policy setting, offline RL requires strong conditions on the data distribution in order to obtain an optimal or near optimal policy. According to (Fan et al., 2020; Agarwal et al., 2020), we consider the following types of datasets:
- •
Replay datasets. Train an online policy until convergence and use all samples during training.
- •
Medium-replay datasets. Train an online sub-optimal policy and use all samples during training.
From Figure 2 and Table 3, it can be observed that GNTD outperforms TD in terms of both convergence speed, final reward, and Bellman error. After incorporating the momentum into the target network update, our GNDQN also dominates DQN in all the reported performance measures.
4.2.2 Continuous Tasks
Finally, we examine the performance of GNTD on the OpenAI Gym MuJoCo tasks using D4RL datasets. In these tasks, we propose a GNTD3+BC variant our method that merges GNTD with TD3+BC (Fujimoto & Gu, 2021), where a behavior cloning term is added to regularize the policy. See Appendix E for details.
Table 2 shows the final numerical results for GNTD3+BC and other baselines, including BC (behavioral cloning), BCQ (Fujimoto et al., 2019), CQL (Kumar et al., 2020) and TD3+BC (Fujimoto & Gu, 2021). Compared with TD3+BC, GNTD3+BC has higher final returns and lower variance in multiple environmental settings.
Appendix A Kronecker-Factored Approximate Curvature (K-FAC) Method for GNTD
In this section, we introduce the Kronecker-Factored Approximate Curvature (K-FAC) method (Martens & Grosse, 2015), which provides an efficient implementation of neural GNTD.
The update formula of GNTD (7) provided in Section 2.2, although differs from the natural gradient method in terms of expression (natural gradient method requires the assumption that the loss function is the negative log-likelihood of normal distribution), they have a similar functional structure.
For a feed-forward deep neural network (DNN) with layers, we denote the weight matrices as of -th layer () and we denote the ReLU activation function as . For any state-action pair , the output is in general a non-convex function of the weights \theta=\big{[}\theta_{1}^{\top},\ldots,\theta_{L}^{\top}\big{]}^{\top}. Alternatively, can also be viewed as an parameter matrix that maps -dimensional vectors to -dimensional vectors. We define as a matrix form of the vector parameters related to the number of neurons in a single layer and define as a flattened vector form of the matrix parameters. The following algorithm describes network’s forward and backward pass for a single state-action pair .
From Algorithm 2, with the weights of the neural network being , we let and denote the forward vector and backward vector of the -th layer, respectively, and we define the matrices
[TABLE]
For a training dataset that contains multiple data-points, the K-FAC method attempts to approximate the matrix in (5) by the following block-diagonal matrix
[TABLE]
After incorporating the identity matrix originated from the Levenberg-Marquardt method, then we approximately calculate the matrix inversion as follows
[TABLE]
For the stochastic sampling case where the expectation are approximated by the sample averages, we let and be the empirical estimators of and , which are given as
[TABLE]
where and are constructed by running Algorithm 2 for the -th data point and utilize (12). Similarly, let be an estimator of the semi-gradient , and let be the semi-gradient of the -th layer. Then the descending direction for the -th layer is
[TABLE]
Then we can naturally get the expression for the parameter update:
[TABLE]
Appendix B Extending GNTD to Q-learning Algorithms
As mentioned in Section 4.2.1, we extend the GNTD method to offline Q-learning algorithms, where we consider policy optimization instead of just policy evaluation. Specifically, with being a batch of tuples collected from the distribution and , we define stochastic estimator of the semi-gradient as
[TABLE]
for any tuple . The TD error term in (16) is induced by the Bellman optimality operator instead of the Bellman operator:
[TABLE]
Here in TD and in DQN (Mnih et al., 2013). Combining with the curvature matrix, we design the GNTD and GNDQN learning algorithms. See Algorithm 3 for more details.
Appendix C Analysis of Population Updates in Section 3.1
Throughout the discussion of population update, we will write and in the -th iteration.
C.1 Proof of Theorem 3.3 (Under-parameterized Linear GNTD)
Recall that is a diagonal matrix. For linear parameterization where , the population GNTD update (3) has an explicit formula:
[TABLE]
where
[TABLE]
Let be the optimal linear approximator under the feature matrix , then we introduce two important lemmas for the analysis of under-parameterized linear function approximation.
Lemma C.1**.**
(Bhandari et al., 2018)* Under Assumption 3.1, we have that*
[TABLE]
Lemma C.2**.**
(Cai et al., 2019)* Under Assumption 3.1, we have that *
Now we are ready to prove Theorem 3.3.
Proof.
Recall the update formula from (17), then we have
[TABLE]
By Assumption 3.2 and Lemma C.1, we have
[TABLE]
Choosing yields
[TABLE]
Then by Lemma C.2, we have
[TABLE]
Thus we complete the proof. ∎
C.2 Proof of Theorem 3.5 (Over-parameterized Linear GNTD)
By Assumption 3.4, the -weighted Gram matrix is nonsingular, the least square subproblem of the population update (3) has an explicit solution
[TABLE]
where is the population TD error vector. Consequently, we have
[TABLE]
Therefore,
[TABLE]
This completes the proof.
C.3 Proof of Theorem 3.7 (Neural GNTD)
Let be the Jacobian matrix. For the neural network function approximation (9), we can rewrite (3) as
[TABLE]
where we denote . Let , where corresponds to the -th diagonal element of the diagonal matrix . To simplify the notation, let . Then we define the -th element of the -weighted expected Gram matrix as follows:
[TABLE]
where is the indicator function and the expectation is taken w.r.t. the Gaussian initialization of the weights. Additionally, we define the -weighted Gram matrix in the -th iteration as . Let (Du et al., 2018) suggests that can be ensured by setting the network width to be an appropriate polynomial of , , and . However, the constant is only related to the network width , and does not affect the convergence rate of GNTD. Therefore, to simplify the discussion, we omit the constant in subsequent proofs.
Lemma C.3**.**
(Du et al., 2018)* Suppose Assumption 3.6 holds, then . Define . If the network width , then we have w.p. .*
Lemma C.4**.**
(Zhang et al., 2019)* Suppose Assumption 3.6 holds. For any , denote , then w.p. at least , we have for all satisfying .*
The above lemma shows that as long as the parameters is close to the random initialization, the corresponding -weighted Jacobian matrix is also closed to the initial -weighted Jacobian matrix . Thus we expect as long as the iteration stays close enough to . As a result, we have the following lemma.
Lemma C.5**.**
Suppose Assumptions 3.1 and 3.6 hold. If the network width satisfies m=\Omega\Big{(}\frac{n^{3}}{\nu^{2}\lambda_{0}^{4}\delta^{2}(1-\gamma)^{2}}\Big{)} and satisfies , then , we have with being a constant, and with being the -weighted Gram matrix at .
Proof.
First, by setting in Lemma C.4, then w.p. we have
[TABLE]
According to the initialization and , we let denote the expectation w.r.t. and for each . Then under Assumption 3.1, we have
[TABLE]
where (i) follows the fact that and have the same marginal distribution, (ii) follows the independence among ’s and the fact that , and (iii) follows the expectation of where is the dimension of the feature mapping. Thus \frac{6\|Q^{0}-TQ^{0}\|_{\mu}}{(1-\gamma)\sqrt{\lambda_{\min}(G^{0})}}=\mathcal{O}\Big{(}\frac{\sqrt{1/\delta}}{1-\gamma}\Big{)} w.p. by using Markov inequality. Let m=\Omega\Big{(}\frac{n^{3}}{\nu^{2}\lambda_{0}^{4}\delta^{2}(1-\gamma)^{2}}\Big{)}, then , where is a constant.
Next, based on the inequality that where denotes singular value, we have
[TABLE]
where the last inequality uses the fact that when the network width is large enough. ∎
Lemma C.6**.**
Conditioning on the success of the high probability event in Lemma C.5, if , then the following inequalities hold
[TABLE]
and
[TABLE]
where is the constant defined in Lemma C.5.
Proof.
We only need to verify the first inequality. Let and calculate
[TABLE]
Next we estimate the bound on the norm of the second term . Note that
[TABLE]
where (i) is due to and (ii) is due to . Thus we complete the proof. ∎
Now we are ready to provide the proof of the Theorem 3.7.
Proof.
Note that the key results of Lemma C.5 and C.6 all rely on the condition that the analyzed point stays close to . Therefore, to prove this theorem with the above lemmas, we will need to prove the following argument by induction:
[TABLE]
Then the final convergence rate result will be automatically covered as a byproduct in the proof of (20). When , (20) is obviously true. Now, suppose (20) holds for all , we prove this argument for .
First, let us denote . Conditioning on the success of the high probability event in Lemma C.5, then Lemma C.6 and (20) indicates that (18) and (19) hold for . Consequently, for any , we have
[TABLE]
where . Let be large enough such that , then
[TABLE]
As a result, for , we have
[TABLE]
where (i) is due to Lemma C.6. Thus we have that for .
Consequently, for , we have
[TABLE]
where (i) is due to Lemma C.5. Hence we complete the proof of (20). As a byproduct, we have (22) for all iterations, which further implies the convergence rate result: ∎
Remark C.7*.*
Note that (Zhang et al., 2019) theoretically verifies the efficient performance of the natural gradient method (or Gauss-Newton method) in deep learning. We find that this technique also works well on the semi-gradient method of policy evaluation. Unlike classification or regression problems, neural GNTD retains the structure of the FQI well, exploiting the contraction property of the Bellman operator to obtain global convergence straightforwardly.
C.4 Proof of Theorem 3.10 (Nonlinear Smooth GNTD)
Proof.
Under Assumptions 3.8 and 3.9, the subproblem in the population update (3) has a closed form solution:
[TABLE]
where is the Jacobian matrix and is the population TD error. Consequently,
[TABLE]
Define the residual term . By Assumption 3.8, we have
[TABLE]
Recall the notation , with defined by (10), we have
[TABLE]
As a result, we have
[TABLE]
where (i) follows that is the contraction operator, , and is bounded as mentioned above. Thus we complete the proof. ∎
Appendix D Analysis of Stochastic Updates in Section 3.2
Before starting the analysis of stochastic GNTD, we introduce a few notation. In the -th iteration, we obtain a batch of data tuples of the form , we call this set of data tuples as . For each , we define . Consequently, the semi-gradient estimator defined in (7) can also be written as . Let be the empirical estimator of based on the dataset , then the stochastic estimator defined in (7) can be equivalently written as , where .
To analyze the convergence of stochastic GNTD, as mentioned in Section 3.2, it is necessary to ensure that is bounded for each . This is mainly because the semi-gradient estimator is controlled by . Define
[TABLE]
Later on, we will discuss the high probability uniform upper bound of under different settings, including under-parameterized linear functions, over-parameterized linear functions and neural network functions. As a result, let be the sigma-algebra generated by the randomness until the iteration . Then we have Based on such upper bounds, we introduce the following two lemmas.
Lemma D.1**.**
Let be a sequence of independent random vectors. Assume and , then
[TABLE]
The proof of this result is straightforward. By , we have Then applying Lemma 4.1 (Lan, 2020) proves this lemma.
Lemma D.2**.**
Suppose Assumption 3.8 holds true, then for any and any iteration , we have w.p. as long as the sample size .
Proof.
First, recall the definition that
[TABLE]
Define , then it holds that and By Matrix Bernstein Inequality (Tropp et al., 2015), we have
[TABLE]
Choosing the batch size to satisfy proves this lemma. ∎
D.1 Proof of Theorem 3.11 (Stochastic Under-parameterized Linear GNTD)
First of all, let us provide a few supporting lemmas. For linear approximation, the stochastic GNTD update (7) can be written as with
[TABLE]
The following lemma provides a one-step progress for stochastic under-parameterized linear GNTD.
Lemma D.3**.**
Consider linear parameterization . For any , we set , and the sample size in iteration . Under Assumptions 3.1 and 3.2, then
[TABLE]
Proof.
According to the stochastic update of GNTD with linear approximation, we have
[TABLE]
By Assumption 3.2 and Lemma D.2, if the sample size for any , we have
[TABLE]
Let . Then we have that
[TABLE]
and
[TABLE]
where (i) and (ii) are both due to (25) and Assumption 3.2. Next by Lemma D.1, we have
[TABLE]
Thus the second term on the right side of equation (24) can be estimated as
[TABLE]
, where (i) follows (D.1) and (ii) follows from Lemma C.1. The last term on the right side of equation (24) can be decomposed as follows
[TABLE]
Observe that the first term on the right side of equation (29) does not exceed by Lemma C.1. The second term on the right side of equation (29) can be estimated as
[TABLE]
The last term on the right side of equation (29) can be estimated as
[TABLE]
where (i) and (iii) both follow the inequality that , and (ii) follows (D.1). Choosing yields
[TABLE]
. Plugging (D.1), (D.1) into (24) yields that
[TABLE]
∎
The above lemma shows that the one-step error is very much related to the sample size . As long as is sufficiently large, and are both uniformly bounded for each . See Lemma D.4.
Lemma D.4**.**
Suppose Assumptions 3.1 and 3.2 hold. We define Set and the sample size for each iteration. Then w.p. , we have
[TABLE]
for any and for all .
Proof.
We will prove this lemma by induction. First of all, for any and any tuple , we have
[TABLE]
where (i) and (ii) are due to the fact that , and (iii) is due to . Substituting into (D.1) proves (31) for .
Now, suppose (31) holds for , then we prove this argument for . For , we choose in Lemma D.3 and the batch size , then we have that
[TABLE]
for . Because , conditioning on the success of the above high probability event, for , we have
[TABLE]
Substituting the above inequality to (D.1) yields
[TABLE]
Hence we have and we have proved (31) for . By induction, we have (31) holds for w.p. . ∎
Now we are ready to provide the proof of Theorem 3.11. Recall the definition of in Lemma D.4. We restate Theorem 3.11 as follows to include the discussion of the specific parameters.
Theorem D.5**.**
Suppose Assumptions 3.1 and 3.2 hold and suppose the target accuracy level . If we choose and the sample size for each iteration, where is the dimension of the parameter . Then w.p. , the output of Algorithm 1 satisfies
[TABLE]
Consequently, we have \|Q^{K}-Q^{\pi}\|_{\mu}\leq\mathcal{O}\big{(}\varepsilon+\|\Pi_{\mu}Q^{\pi}-Q^{\pi}\|_{\mu}\big{)} with iterations and samples in total.
Proof.
First, by Lemma D.4, we have for w.p. . Then Lemma D.3 indicates that
[TABLE]
. Then by Lemma C.2, we complete the proof. ∎
D.2 Proof of Theorem 3.12 (Stochastic Over-parameterized Linear GNTD)
By Lemma D.2, for any , when the batch size , we have . Let us write and then . Notice that is no longer positive definite in the over-parameterized setting, we need to deal with the term via the Sherman-Morrison-Woodbury (SMW) formula, where is positive definite with high probability when . We also discuss the uniform upper bound of in Lemma D.7.
Lemma D.6**.**
For any , we set , and the sample size for -th iteration. Under Assumptions 3.1 and 3.4, we have w.p. that
[TABLE]
Proof.
Recall that for linear approximation, with . Then we can compute
[TABLE]
For any when the sample size , we have that . Let , and we konw that is invertible. Then by the Sherman-Morrison-Woodbury (SMW) formula, we have
[TABLE]
Consider the singular value decomposition, we write . Then,
[TABLE]
where the last inequality is because . Thus by Lemma D.1, we have that
[TABLE]
and
[TABLE]
where the second inequality uses the fact that . Plugging (34), (35) into (D.2) yields that given ,
[TABLE]
, where . This completes the proof. ∎
Lemma D.7**.**
Suppose Assumptions 3.1 and 3.4 hold. We define
[TABLE]
Suppose the accuracy level is small enough s.t. If we set , , and the batch size satisfies for each iteration. Then for and , we have w.p. that
[TABLE]
Proof.
Similar to Lemma D.4, we prove this lemma by induction. For , we have
[TABLE]
Thus the lemma holds for . Suppose it holds for all , we prove this argument for .
For , we choose in Lemma D.6 and , , then we have
[TABLE]
Note that for , Then conditioning on the success of the high probability events in the first steps, by (D.2) in Lemma D.6, we have
[TABLE]
where (i) follows (37), and (ii) follows from
[TABLE]
Therefore, we also have
[TABLE]
where (i) follows a computation similar to (D.1). Therefore, we prove by induction that (36) holds for , w.p. . ∎
Now we are ready to prove Theorem 3.12. To include the discussion of the specific parameters in the theorem, we restate Theorem 3.12 as follows.
Theorem D.8**.**
Suppose Assumptions 3.1 and 3.4 hold and the accuracy level . If we set and the sample size for each iteration, where is the dimension of the parameter . Then w.p. the output of Algorithm 1 satisfies
[TABLE]
Then we can guarantee with iterations and samples in total.
Proof.
First we have that when the conditions in Lemma D.7 holds. By replacing in Lemma D.6 with , we have that
[TABLE]
Thus we complete the proof. ∎
D.3 Proof of Theorem 3.13 (Stochastic Neural GNTD)
Recall the stochastic GNTD formula with the neural network function (9) approximation:
[TABLE]
where the feature vector and the parameters . Note that Lemma C.3, C.4, and C.5 are irrelevant to the update rule, thus they still hold in the current discussion. Next, we provide the uniform bound of the defined in (23).
Lemma D.9**.**
Conditioning on the success of the high probability events of Lemma C.3 and C.5, where the success probability are chosen as , then for any and any satisfying , we have w.p. that
[TABLE]
Proof.
First we compute the bounds on the gradient norm of the Q function as follows
[TABLE]
Consequently, by decomposing the in into and yields
[TABLE]
where (i) follows and , and (ii) follows the distance between and . ∎
Lemma D.10**.**
Suppose Assumptions 3.1 and 3.6 hold. For any , we choose and the sample size for -th iteration. If the iteration , we have that
[TABLE]
and
[TABLE]
where .
Proof.
According to Lemma D.9, for any that satisfies , we have for any . Recalling in Lemma C.6, we consider
[TABLE]
The first term of the above equation is almost identical the estimate of equation (D.2) in Lemma D.6. Recall that Lemma C.6 in Section C.3 provides a technique for analyzing the residual term. We follow this derivation and get
[TABLE]
, where (i) follows Lemma C.5 or the same derivation as Lemma C.6, and we can reduce to the case of estimating (D.2). Finally, by Lemma D.6 and D.9, we have that
[TABLE]
where represents the probability that the -weighted Gram matrix is not positive definite, and represents the probability that the concentration inequality fails. Plugging (39) and (40) into (38), we complete the proof. ∎
To simplify the notation, let us denote
[TABLE]
Now we provide the proof of Theorem 3.13, which is restated as follows, with more detailed discussion of the parameters.
Theorem D.11**.**
Suppose Assumptions 3.1 and 3.6 hold and the accuracy level s.t. . We set , , and the batch size s.t. , where is the dimension of the feature map , and is a small constant. Then w.p. , the output of Algorithm 1 satisfies
[TABLE]
Then we can guarantee with iterations and samples in total.
Proof.
Similar to Section C.3, the key results of Lemmas D.9 and D.10 all depend on the condition that stays close to . Thus we will need to prove the following result by induction
[TABLE]
Obviously (41) holds for .
We assume (41) holds for and prove this argument for . Note that D.9 and D.10 hold for due to (41). Next we set and . With , we have that the following inequality holds for
[TABLE]
where (i) is due to the same derivation in Section C.3, (ii) is due to Lemma D.10, and (iii) is due to as long as the network width is sufficiently large. Consequently, we have for that
[TABLE]
where the last inequality is due to the selection of and in the theorem. Note that the accuracy level is small enough so that . Thus for , we have
[TABLE]
where (i) is due to Lemma D.10, and (ii) is due to for any . Therefore, the statement (41) holds.
The above verifies that when the conditions in Theorem D.11 hold, will always stay close to the initialization parameters with high probability. Thus the lemmas in Section D.3 are all correct. Finally, by Lemma D.10, for each , we have that
[TABLE]
This completes the proof. ∎
D.4 Proof of Theorem 3.14 (Stochastic Nonlinear Smooth GNTD)
We restate Theorem 3.14 as follows to include the specifics of the parameters.
Theorem D.12**.**
Suppose Assumptions 3.1, 3.8 and 3.9 hold. we set and the damping rate for each iteration, where . Using the Proximal Boost (ProxBoost) method to obtain , the output of Algorithm 1 has w.p. that
[TABLE]
for given constants . We choose the step size , the damping rate and the sample size for any . Then we can guarantee with iterations and samples in total.
Proof.
To begin with, we rewrite the problem (6) as and define its population version as Recalling the definition of in Section 3.1.3, we have . Let . For any , we have .
Now we consider the constrained subproblem . To solve this subproblem, we use the ProxBoost procedure, whose output is for each . Set and we have . For any satisfying , is - and -strongly convex w.r.t. , where . By Lemma D.2, when the number of samples per iteration in the ProxBoost method is at least , the empirical loss function is - and -strongly convex with probability at least . At this point, the subproblem satisfies the convergence conditions of the ProxBoost method. By Corollary 7 in (Davis et al., 2021), for any , we have that
[TABLE]
when the total number of samples used by the ProxBoost method in -th iteration is
[TABLE]
Thus,
[TABLE]
Therefore, there exists a constant such that
[TABLE]
Let for each iteration of GNTD. Then similar to Theorem 3.10, we have that
[TABLE]
where (i) is due to the key derivation in Section C.4, and (ii) is due to the gap between and . Choose in the above inequality and add them all. Then, we have that
[TABLE]
Thus we complete the proof. ∎
Appendix E Additional Experiments on Section 4.2.2
In this section, we supplement some implementation details and numerical results of GNTD3+BC. On the basis of TD3-BC (Fujimoto & Gu, 2021), it is common to add a behavioral cloning regular term to constrain the expected total return
[TABLE]
For the critic part, we apply GNTD method to minimize the following MSBE of the Clipped Double DQN (Fujimoto et al., 2018), that is,
[TABLE]
where
[TABLE]
See Algorithm 4 for more details.
Figure 3 shows the training curves for different algorithms. GNTD3+BC algorithm (red) outperforms TD3+BC algorithm (green) both in terms of final results and convergence speed.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Achiam et al. (2019) Achiam, J., Knight, E., and Abbeel, P. Towards characterizing divergence in deep q-learning. ar Xiv preprint ar Xiv:1903.08894 , 2019.
- 2Agarwal et al. (2020) Agarwal, R., Schuurmans, D., and Norouzi, M. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning , pp. 104–114. PMLR, 2020.
- 3Agazzi & Lu (2022) Agazzi, A. and Lu, J. Temporal-difference learning with nonlinear function approximation: lazy training and mean field regimes. In Mathematical and Scientific Machine Learning , pp. 37–74. PMLR, 2022.
- 4Allen-Zhu et al. (2019) Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning , pp. 242–252. PMLR, 2019.
- 5Bhandari et al. (2018) Bhandari, J., Russo, D., and Singal, R. A finite time analysis of temporal difference learning with linear function approximation. In Conference on learning theory , pp. 1691–1692. PMLR, 2018.
- 6Borkar (2009) Borkar, V. S. Stochastic approximation: a dynamical systems viewpoint , volume 48. Springer, 2009.
- 7Boyan (2002) Boyan, J. A. Technical update: Least-squares temporal difference learning. Machine learning , 49(2):233–246, 2002.
- 8Bradtke & Barto (1996) Bradtke, S. J. and Barto, A. G. Linear least-squares algorithms for temporal difference learning. Machine learning , 22(1):33–57, 1996.
