Gauss-Newton Temporal Difference Learning with Nonlinear Function   Approximation

Zhifa Ke; Junyu Zhang; Zaiwen Wen

arXiv:2302.13087·math.OC·April 2, 2024

Gauss-Newton Temporal Difference Learning with Nonlinear Function Approximation

Zhifa Ke, Junyu Zhang, Zaiwen Wen

PDF

Open Access

TL;DR

None

Contribution

None

Abstract

In this paper, a Gauss-Newton Temporal Difference (GNTD) learning method is proposed to solve the Q-learning problem with nonlinear function approximation. In each iteration, our method takes one Gauss-Newton (GN) step to optimize a variant of Mean-Squared Bellman Error (MSBE), where target networks are adopted to avoid double sampling. Inexact GN steps are analyzed so that one can safely and efficiently compute the GN updates by cheap matrix iterations. Under mild conditions, non-asymptotic finite-sample convergence to the globally optimal Q function is derived for various nonlinear function approximations. In particular, for neural network parameterization with relu activation, GNTD achieves an improved sample complexity of $\tilde{O} (ε^{- 1})$ , as opposed to the $O (ε^{- 2})$ sample complexity of the existing neural TD methods. An…

Tables3

Table 1. Table 1: Total iterations and sample complexity for policy evaluation.

	Over-paramete- -rized Linear	Neural	Smooth
Total iterations in population update
TD	$𝒪 (ε^{- 2})$	$𝒪 (ε^{- 2})$	– –
GNTD	$𝒪 (\log ε^{- 1})$	$𝒪 (\log ε^{- 1})$	$𝒪 (ε^{- 1} \log ε^{- 1})$
Sample complexity in stochastic update
TD	$𝒪 (ε^{- 4})$	$𝒪 (ε^{- 4})$	– –
GNTD	$𝒪 (ε^{- 2} \log ε^{- 1})$	$𝒪 (ε^{- 2} \log ε^{- 1})$	$𝒪 (ε^{- 3} {(\log ε^{- 1})}^{2})$

Table 2. Table 2: Average normalized score over the final 10 evaluations and 5 seeds. Under the same configuration, we run the four algorithms of BC, BCQ, TD3+BC, GNTD3+BC, and also use the results reported by CQL and TD3+BC before. ± plus-or-minus \pm captures the standard deviation.

	BC	BCQ	CQL	TD3+BC	TD3+BC(Ours)	GNTD3+BC(Ours)
Halfcheetah-medium	42.4 $\pm$ 0.2	47.2 $\pm$ 0.4	37.2	42.8	46.5 $\pm$ 17.6	56.7 $\pm$ 0.3
Hopper-medium	30.1 $\pm$ 0.3	34.0 $\pm$ 3.8	44.2	99.5	100.2 $\pm$ 0.2	100.5 $\pm$ 0.3
Walker2d-medium	12.6 $\pm$ 3.1	53.3 $\pm$ 9.1	57.5	79.7	79.4 $\pm$ 1.6	81.1 $\pm$ 1.3
Halfcheetah-medium-replay	34.5 $\pm$ 0.8	33.0 $\pm$ 1.7	41.9	43.3	42.2 $\pm$ 0.8	42.6 $\pm$ 0.4
Hopper-medium-replay	20.0 $\pm$ 3.1	28.6 $\pm$ 1.1	28.6	31.4	31.8 $\pm$ 2.0	32.4 $\pm$ 1.3
Walker2d-medium-replay	8.1 $\pm$ 1.3	11.5 $\pm$ 1.3	15.8	25.2	23.9 $\pm$ 1.6	24.8 $\pm$ 2.1
Halfcheetah-medium-expert	70.6 $\pm$ 7.1	84.4 $\pm$ 4.5	27.1	97.9	90.9 $\pm$ 3.6	98.2 $\pm$ 3.3
Hopper-medium-expert	92.5 $\pm$ 15.1	111.4 $\pm$ 1.2	111.4	112.2	111.9 $\pm$ 0.4	112.0 $\pm$ 0.1
Walker2d-medium-expert	12.2 $\pm$ 3.3	50.7 $\pm$ 7.2	68.1	101.1	94.6 $\pm$ 15.5	104.5 $\pm$ 4.4
Halfcheetah-expert	104.9 $\pm$ 1.4	96.6 $\pm$ 2.8	82.4	105.7	103.9 $\pm$ 1.7	107.6 $\pm$ 0.6
Hopper-expert	111.3 $\pm$ 0.9	108.7 $\pm$ 5.1	111.2	112.2	112.3 $\pm$ 0.1	112.2 $\pm$ 0.5
Walker2d-expert	58.1 $\pm$ 9.1	92.6 $\pm$ 5.0	103.8	105.7	105.2 $\pm$ 1.8	107.6 $\pm$ 1.5
Total	597.3 $\pm$ 37.7	752.0 $\pm$ 43.2	728.9	956.7	942.8 $\pm$ 46.9	980.2 $\pm$ 16.1

Table 3. Table 3: The Bellman error of four algorithms TD, DQN, GNTD, and GNDQN on various datasets.

Data set	TD	DQN	GNTD	GNDQN
CartPole-rep	757.93	0.6	3.03	0.51
CartPole-med-rep	97.78	0.66	1.51	0.58
Acrobot-rep	1.32	0.63	0.71	0.52
Acrobot-med-rep	1.41	0.68	0.68	0.58

Equations312

θ min L_{μ} (θ)

θ min L_{μ} (θ)

L_{μ} (θ; θ_{t a r g}) = ∥ Q (θ) - T^{π} Q (θ_{t a r g}) ∥_{μ}^{2} .

L_{μ} (θ; θ_{t a r g}) = ∥ Q (θ) - T^{π} Q (θ_{t a r g}) ∥_{μ}^{2} .

θ^{k + 1} = θ^{k} - η g^{k},

θ^{k + 1} = θ^{k} - η g^{k},

θ^{k + 1} \approx argmin_{θ} L_{μ} (θ; θ^{k})

θ^{k + 1} \approx argmin_{θ} L_{μ} (θ; θ^{k})

Q^{π} (s, a) := E_{π} [t = 0 \sum \infty γ^{t} \cdot r (s_{t}, a_{t}) ∣ s_{0} = s, a_{0} = a], \forall s, a .

Q^{π} (s, a) := E_{π} [t = 0 \sum \infty γ^{t} \cdot r (s_{t}, a_{t}) ∣ s_{0} = s, a_{0} = a], \forall s, a .

T^{π} Q (s, a) := r (s, a) + γ P^{π} Q (s, a), \forall s, a,

T^{π} Q (s, a) := r (s, a) + γ P^{π} Q (s, a), \forall s, a,

θ min ∥ Q (θ) - T^{π} Q (θ^{k}) ∥_{μ}^{2} .

θ min ∥ Q (θ) - T^{π} Q (θ^{k}) ∥_{μ}^{2} .

{d^{k} = argmin_{d} ∥ Q (θ^{k}) + J_{Q} (θ^{k}) d - T^{π} Q (θ^{k}) ∥_{μ}^{2}, θ^{k + 1} = θ^{k} + β \cdot d^{k} .

{d^{k} = argmin_{d} ∥ Q (θ^{k}) + J_{Q} (θ^{k}) d - T^{π} Q (θ^{k}) ∥_{μ}^{2}, θ^{k + 1} = θ^{k} + β \cdot d^{k} .

∥ Q (θ^{k} + β d^{k}) - Q (θ^{k}) - β J_{Q} (θ^{k}) d^{k} ∥ \leq O (β^{2}) .

∥ Q (θ^{k} + β d^{k}) - Q (θ^{k}) - β J_{Q} (θ^{k}) d^{k} ∥ \leq O (β^{2}) .

Q (θ^{k + 1}) = β \cdot T^{π} Q (θ^{k}) + (1 - β) \cdot Q (θ^{k}) + O (β^{2}) .

Q (θ^{k + 1}) = β \cdot T^{π} Q (θ^{k}) + (1 - β) \cdot Q (θ^{k}) + O (β^{2}) .

\ell(\xi,\theta^{k},d):=\big{(}\nabla Q(s,a;\theta^{k})^{\top}d+\delta^{k}(\xi)\big{)}^{2}

\ell(\xi,\theta^{k},d):=\big{(}\nabla Q(s,a;\theta^{k})^{\top}d+\delta^{k}(\xi)\big{)}^{2}

d^{k} = argmin_{d} L (θ^{k}, d) := E_{ξ \sim D} [ℓ (ξ, θ^{k}, d)] .

d^{k} = argmin_{d} L (θ^{k}, d) := E_{ξ \sim D} [ℓ (ξ, θ^{k}, d)] .

H (θ^{k}) := E_{(s, a) \sim μ} [\nabla Q (s, a; θ^{k}) \nabla Q (s, a; θ^{k})^{⊤}],

H (θ^{k}) := E_{(s, a) \sim μ} [\nabla Q (s, a; θ^{k}) \nabla Q (s, a; θ^{k})^{⊤}],

g (θ^{k}) := E_{ξ \sim D} [δ^{k} (ξ) \cdot \nabla Q (s, a; θ^{k})] .

\hat{d}^{k} = ar g d min \frac{1}{N} ξ \in D_{k} \sum ℓ (ξ, θ^{k}, d) + \frac{ω}{2} ∥ d ∥_{2}^{2},

\hat{d}^{k} = ar g d min \frac{1}{N} ξ \in D_{k} \sum ℓ (ξ, θ^{k}, d) + \frac{ω}{2} ∥ d ∥_{2}^{2},

⎩ ⎨ ⎧ \hat{H}^{k} = \frac{1}{N} \sum_{ξ \in D_{k}} \nabla Q (s, a; θ^{k}) \nabla Q (s, a; θ^{k})^{⊤}, \overset{g}{^}^{k} = \frac{1}{N} \sum_{ξ \in D_{k}} δ^{k} (ξ) \cdot \nabla Q (s, a; θ^{k}), θ^{k + 1} = θ^{k} - β \cdot (\hat{H}^{k} + ω I)^{- 1} \overset{g}{^}^{k},

⎩ ⎨ ⎧ \hat{H}^{k} = \frac{1}{N} \sum_{ξ \in D_{k}} \nabla Q (s, a; θ^{k}) \nabla Q (s, a; θ^{k})^{⊤}, \overset{g}{^}^{k} = \frac{1}{N} \sum_{ξ \in D_{k}} δ^{k} (ξ) \cdot \nabla Q (s, a; θ^{k}), θ^{k + 1} = θ^{k} - β \cdot (\hat{H}^{k} + ω I)^{- 1} \overset{g}{^}^{k},

∥ T^{π} Q_{1} - T^{π} Q_{2} ∥_{μ} \leq γ ∥ Q_{1} - Q_{2} ∥_{μ}, \forall Q_{1}, Q_{2} \in R^{∣ S ∣ \times ∣ A ∣} .

∥ T^{π} Q_{1} - T^{π} Q_{2} ∥_{μ} \leq γ ∥ Q_{1} - Q_{2} ∥_{μ}, \forall Q_{1}, Q_{2} \in R^{∣ S ∣ \times ∣ A ∣} .

Q (s, a; θ) = ϕ (s, a)^{⊤} θ, \forall s, a .

Q (s, a; θ) = ϕ (s, a)^{⊤} θ, \forall s, a .

Φ θ^{*} = Π_{μ} T Φ θ^{*},

Φ θ^{*} = Π_{μ} T Φ θ^{*},

∥ Q (θ^{K}) - Q^{π} ∥_{μ} \leq

∥ Q (θ^{K}) - Q^{π} ∥_{μ} \leq

+ \frac{1}{1 - γ} ∥ Π_{μ} Q^{π} - Q^{π} ∥_{μ} .

Q^{π} = Q^{*} \in Span {Φ},

Q^{π} = Q^{*} \in Span {Φ},

Q^{π} (s, a) = r (s, a) + (s^{'}, a^{'}) \in Ω \sum P (s^{'} ∣ s, a) \cdot π (a^{'} ∣ s^{'}) \cdot Q^{π} (s^{'}, a^{'}) .

Q^{π} (s, a) = r (s, a) + (s^{'}, a^{'}) \in Ω \sum P (s^{'} ∣ s, a) \cdot π (a^{'} ∣ s^{'}) \cdot Q^{π} (s^{'}, a^{'}) .

θ min E [∥ Q (θ) - T Q (θ) ∥_{μ}^{2}] = E [∥ Q (θ) - T Q (θ) ∥_{Ω, μ}^{2}],

θ min E [∥ Q (θ) - T Q (θ) ∥_{μ}^{2}] = E [∥ Q (θ) - T Q (θ) ∥_{Ω, μ}^{2}],

∥ Q (θ^{K}) - Q^{π} ∥_{μ} \leq (1 - (1 - γ) β)^{K} ∥ Q (θ^{0}) - Q^{π} ∥_{μ} .

∥ Q (θ^{K}) - Q^{π} ∥_{μ} \leq (1 - (1 - γ) β)^{K} ∥ Q (θ^{0}) - Q^{π} ∥_{μ} .

Q (s, a; θ) := \frac{1}{m} r = 1 \sum m b_{r} ψ (θ_{r}^{⊤} ϕ (s, a)),

Q (s, a; θ) := \frac{1}{m} r = 1 \sum m b_{r} ψ (θ_{r}^{⊤} ϕ (s, a)),

∥ Q (θ^{K}) - Q^{π} ∥_{μ} \leq (1 - \frac{( 1 - γ ) β}{2})^{K} \cdot ∥ Q (θ^{0}) - Q^{π} ∥_{μ} .

∥ Q (θ^{K}) - Q^{π} ∥_{μ} \leq (1 - \frac{( 1 - γ ) β}{2})^{K} \cdot ∥ Q (θ^{0}) - Q^{π} ∥_{μ} .

∥ Q (s, a; θ_{1}) - Q (s, a; θ_{2}) ∥ \leq L_{1} ∥ θ_{1} - θ_{2} ∥, \forall θ_{1}, θ_{2},

∥ Q (s, a; θ_{1}) - Q (s, a; θ_{2}) ∥ \leq L_{1} ∥ θ_{1} - θ_{2} ∥, \forall θ_{1}, θ_{2},

∥\nabla Q (s, a; θ_{1}) - \nabla Q (s, a; θ_{2}) ∥ \leq L_{2} ∥ θ_{1} - θ_{2} ∥, \forall θ_{1}, θ_{2} .

∥\nabla Q (s, a; θ_{1}) - \nabla Q (s, a; θ_{2}) ∥ \leq L_{2} ∥ θ_{1} - θ_{2} ∥, \forall θ_{1}, θ_{2} .

ε_{F}^{2} = θ \in F max d min L (θ, d) < \infty,

ε_{F}^{2} = θ \in F max d min L (θ, d) < \infty,

∥ Q (θ^{K}) - Q^{π} ∥_{μ} \leq

∥ Q (θ^{K}) - Q^{π} ∥_{μ} \leq

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Advanced Algorithms and Applications · Machine Learning and ELM

MethodsDense Connections · Q-Learning · Convolution · Deep Q-Network

Full text

Provably Efficient Gauss-Newton Temporal Difference Learning Method

with Function Approximation

Zhifa Ke

Zaiwen Wen

Junyu Zhang

Abstract

In this paper, based on the spirit of Fitted Q-Iteration (FQI), we propose a Gauss-Newton Temporal Difference (GNTD) method to solve the Q-value estimation problem with function approximation. In each iteration, unlike the original FQI that solves a nonlinear least square subproblem to fit the Q-iteration, the GNTD method can be viewed as an inexact FQI that takes only one Gauss-Newton step to optimize this subproblem, which is much cheaper in computation. Compared to the popular Temporal Difference (TD) learning, which can be viewed as taking a single gradient descent step to FQI’s subproblem per iteration, the Gauss-Newton step of GNTD better retains the structure of FQI and hence leads to better convergence. In our work, we derive the finite-sample non-asymptotic convergence of GNTD under linear, neural network, and general smooth function approximations. In particular, recent works on neural TD only guarantee a suboptimal $\mathcal{\mathcal{O}}(\epsilon^{-4})$ sample complexity, while GNTD obtains an improved complexity of $\tilde{\mathcal{O}}(\epsilon^{-2})$ . Finally, we validate our method via extensive experiments in both online and offline RL problems. Our method exhibits both higher rewards and faster convergence than TD-type methods, including DQN.

Machine Learning, ICML

1 Introduction

In this paper, we consider the policy evaluation problem, namely, the problem of evaluating state action value function (Q function). This is a fundamental building block of many popular Reinforcement Learning (RL) algorithms, including policy improvement method (Sutton et al., 1999), trust region policy optimization (Schulman et al., 2015) and the actor-critic algorithms (Konda & Tsitsiklis, 1999; Lillicrap et al., 2015; Fujimoto et al., 2018). A properly evaluated Q function often greatly boosts the performance of these methods. For modern RL with enormous state and action spaces, appropriately parameterizing the Q function with certain function approximation $Q(\theta)$ is crucial to the scalability of RL algorithms, common examples include linear (Bhandari et al., 2018; Zou et al., 2019; Srikant & Ying, 2019) and neural (Cai et al., 2019; Xu & Gu, 2020; Agazzi & Lu, 2022) function approximations.

Let $Q^{\pi}$ be the unknown state action value function to be estimated under policy $\pi$ and let $T^{\pi}$ be the corresponding Bellman operator. Then the standard formulation for policy evaluation with function approximation is to minimize the Mean-Squared Bellman Error (MSBE):

[TABLE]

where $\mu$ is the stationary distribution of the state-action pairs under the policy $\pi$ . A direct optimization of MSBE can be very hard, as the double sampling issue denies an unbiased estimator of $\nabla\mathcal{L}_{\mu}(\theta)$ . A practical technology to address the double sampling issue is to twist the loss function by introducing a target parameter $\theta_{targ}$ as follows

[TABLE]

One popular method for solving (1) is the Temporal Difference (TD) (Sutton, 1988) learning algorithm and its variants (Bradtke & Barto, 1996; Sutton et al., 2009a, b; Tu & Recht, 2018). Given the surrogate loss $\mathcal{L}_{\mu}(\theta;\theta_{targ})$ and the target parameter $\theta_{targ}^{k}=\theta^{k}$ , the vanilla TD step updates by

[TABLE]

where $g^{k}$ is an unbiased estimator of $\nabla_{\theta}\mathcal{L}_{\mu}(\theta^{k};\theta_{targ}^{k})$ . Since $\mathbb{E}[g^{k}]$ only contains a part of $\nabla_{\theta}\mathcal{L}_{\mu}(\theta^{k})$ , TD is also termed as the semi-gradient method. Though TD-type algorithms have been empirically successful in many numerical applications (Mnih et al., 2013; Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018), the semi-gradient nature of the TD-type methods has mostly limited their theoretical convergence guarantees to linear function approximations (Sutton et al., 2009a, b). Recent results on neural TD (Cai et al., 2019; Xu & Gu, 2020) typically require a suboptimal $\mathcal{\mathcal{O}}(\epsilon^{-4})$ sample complexity to find some $\bar{\theta}$ s.t. $\|Q(\bar{\theta})-Q(\theta^{*})\|_{\mu}\leq\epsilon$ , where $\theta^{*}$ is the optimal solution. For general smooth function approximation, except for a few variants (Maei et al., 2009) where only asymptotic convergence is obtained, TD-type methods have divergence issues both in theory and practice (Tsitsiklis & Van Roy, 1996; Maei et al., 2009; Achiam et al., 2019; Brandfonbrener & Bruna, 2019).

Another popular approach for policy evaluation is the Fitted Q-Iteration (FQI) method (Riedmiller, 2005; Chen & Jiang, 2019; Fan et al., 2020), which repeatedly solves a nonlinear least square subproblem to fit the Q-iteration:

[TABLE]

by sufficiently many stochastic gradient steps. Under expressive enough function approximation class such as over-parameterized neural networks, the FQI can largely approximate the contractive Q-iteration $Q(\theta^{k+1})\approx T^{\pi}Q(\theta^{k})$ and hence retain desirable convergence properties. However, the need to solve the subproblem leads to an expensive per-iteration computation, which makes FQI less competitive against TD in practice.

Interestingly, each iteration of TD method can be viewed as an inexact FQI that takes only one stochastic gradient descent step to optimize the nonlinear least square subproblem. The inexactness here can be an interpretation of the divergence behavior of TD learning for general function approximations. This motivates us to design the Gauss-Newton Temporal Difference (GNTD) learning algorithm, which takes one Gauss-Newton step to optimize FQI’s subproblem. The proposed method lies exactly in between FQI and TD, which has cheaper per-iteration computation compared to FQI, and better theoretical convergence compared to TD. In this paper, we provide a complete finite-time convergence analysis of GNTD under linear functions, neural network functions and general smooth functions, for both population and random sampling cases. The detailed sample complexities of GNTD are summarized in Table 1.

1.1 Contributions

Our contributions are summarized as follows.

•

We propose the Gauss-Newton Temporal Difference (GNTD) learning algorithm as an inexact FQI method with Gauss-Newton steps. We also design a practically efficient implementation of GNTD based on damping and the K-FAC method.

•

We derive convergence and sample complexity analysis of GNTD method with both population and stochastic updates, as summarized in Table 1. Compared to the existing results of TD method, GNTD achieves better sample complexities under linear, neural network and general smooth function approximations. In particular, for over-parameterized neural network approximation, GNTD achieves an improved $\mathcal{O}(\frac{1}{\varepsilon^{2}}\log\frac{1}{\varepsilon})$ sample complexity as opposed the $\mathcal{O}(\frac{1}{\varepsilon^{4}})$ complexity of neural TD.

We also conduct extensive experiments to numerically validate the efficiency of GNTD. For both continuous and discrete tasks, our method outperforms TD and its variants, including DQN. Besides the online setting that is theoretically analyzed in this paper, interestingly, GNTD also exhibits advantageous performance in offline RL tasks compared to several state-of-the-art benchmarks.

1.2 Related Work

TD learning was first proposed for policy evaluation and Q-learning (Sutton, 1988), and it later on developed numerous variants, including Gradient TD (Sutton et al., 2009a, b), Least-squares TD (Bradtke & Barto, 1996; Boyan, 2002; Ghavamzadeh et al., 2010; Tu & Recht, 2018), and DQN (Mnih et al., 2013), etc. However, the semi-gradient nature makes the convergence of TD-type methods highly non-trivial. We list the convergence and complexity results of TD-type methods that are most relevant to our work.

Asymptotic Analysis. There are extensive results on the asymptotically convergent analysis of linear TD (Jaakkola et al., 1993; Tsitsiklis & Van Roy, 1996; Perkins & Pendrith, 2002; Borkar, 2009). However, analyzing the nonlinear TD is always challenging. In fact, the nonlinear TD often diverges in practice (Tsitsiklis & Van Roy, 1996; Maei et al., 2009; Achiam et al., 2019; Brandfonbrener & Bruna, 2019).

Finite-time Analysis. The non-asymptotic sample complexities of linear TD are recently analyzed in (Bhandari et al., 2018; Dalal et al., 2018a; Zou et al., 2019). Regarding its variants (Gradient TD and Least-squares TD), finite-time analyzes are also established in (Dalal et al., 2018b; Touati et al., 2018; Liu et al., 2020) and (Lazaric et al., 2010; Prashanth et al., 2014; Tagorti & Scherrer, 2015), respectively. However, such a reformulation leads to bi-level optimization, which is difficult to extend to nonlinear Q-learning and lacks stability in practice (Pfau & Vinyals, 2016). For nonlinear neural TD methods (Cai et al., 2019; Brandfonbrener & Bruna, 2019; Xu & Gu, 2020), the key observation is that wide over-parameterized neural networks are approximately linear under the Neural Tangent Kernel (NTK) regime (Du et al., 2018; Zhang et al., 2019; Allen-Zhu et al., 2019). However, existing neural TD method only achieves the sub-optimal $\mathcal{\mathcal{O}}(\varepsilon^{-4})$ sample complexities.

Besides TD learning, a recent important development for policy evaluation with function approximation is the FQI method (Riedmiller, 2005; Chen & Jiang, 2019; Fan et al., 2020). As long as the function class is expressive enough, and the nonlinear least square fits the Q-iteration well, FQI will always converge to desirable solutions.

2 Gauss-Newton Temporal Difference Learning

2.1 Preliminaries

We consider the infinite-horizon discounted Markov decision process (MDP) $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathbb{P},r,\gamma)$ , with state space $\mathcal{S}$ , action space $\mathcal{A}$ , reward function $r:\mathcal{S}\times\mathcal{A}\rightarrow[-R_{\max},R_{\max}]$ , transition probability $\mathbb{P}$ , and a discount factor $\gamma\in(0,1)$ . Let policy $\pi$ be a mapping that returns a probability distribution $\pi(\cdot|s)$ over the action space $\mathcal{A}$ , for any state $s\in\mathcal{S}$ . Then the state-action value function (Q-function) under policy $\pi$ is

[TABLE]

For any mapping $Q:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ , we denote $T^{\pi}$ as the Bellman operator:

[TABLE]

where $P^{\pi}Q(s,a)\!=\!\mathbb{E}\left[Q(s^{\prime}\!,a^{\prime})|s^{\prime}\!\!\sim\!\mathbb{P}(\cdot|s,a),a^{\prime}\!\!\sim\!\pi\left(\cdot|s^{\prime}\right)\right].$

2.2 The GNTD Method

Recall the nonlinear least square subproblem of FQI:

[TABLE]

Unlike FQI that solves this problem with sufficiently many SGD steps, and unlike TD that optimizes the loss with merely one SGD step, we would like to propose a method lying between FQI and TD methods. Denote $J_{Q}(\theta)$ as the Jacobian matrix of the parameterized Q-function $Q(\theta)$ . Then the GNTD method linearizes the $Q(\theta)$ in the FQI subproblem and updates the iterates by

[TABLE]

The intuition behind the GNTD update is very clear. If $\text{Span}\{J_{Q}(\theta^{k})\}$ spanned by the columns of the Jacobian is expressive enough s.t. $J_{Q}(\theta^{k})d^{k}=T^{\pi}Q(\theta^{k})-Q(\theta^{k}).$ Under proper conditions, informally, one also has

[TABLE]

Combining the above two inequalities yields

[TABLE]

Then the contraction of Bellman operator further yields $\|Q(\theta^{K})-Q^{\pi}\|_{\infty}\leq(1-(1-\gamma)\beta)^{K}\|Q(\theta^{0})-Q^{\pi}\|_{\infty}+\mathcal{O}\big{(}\frac{\beta}{1-\gamma}\big{)}.$ Therefore, setting $\beta=\mathcal{O}(\epsilon)$ or adopting a diminishing sequence of $\beta^{k}\to 0$ will provide the convergence of GNTD in this ideal situation.

For each tuple $\xi\!=\!(s,a,r,s^{\prime},a^{\prime})\sim\mathcal{D}$ where $\mathcal{D}$ is the distribution where $(s,a)\!\sim\!\mu$ , $r=r(s,a)$ , $s^{\prime}\!\sim\!\mathbb{P}(\cdot|s,a)$ , $a^{\prime}\!\sim\!\pi(\cdot|s^{\prime})$ , we define the loss function

[TABLE]

where $\delta^{k}(\xi)=Q(s,a;\theta^{k})-\left(r(s,a)+\gamma Q\left(s^{\prime},a^{\prime};\theta^{k}\right)\right)$ is the TD error w.r.t. the tuple $\xi$ . By removing constant terms, the GNTD subproblem for $d^{k}$ in (3) can be rewritten in a more sample-friendly form:

[TABLE]

Define the curvature matrix $H$ and the semi-gradient $g$ as

[TABLE]

Then (4) has a closed form solution $d^{k}\!=\!-[H(\theta^{k})]^{-1}g(\theta^{k})$ . Note that the solution $d^{k}$ is the Gauss-Newton direction for solving the nonlinear system $Q(\theta)=T^{\pi}Q(\theta^{k})$ (Wright et al., 1999), and the semi-gradient $g(\theta^{k})$ is exactly the expected TD direction. Hence we call our method the Gauss-Newton Temporal Difference (GNTD) method.

Note that the population update (3) is not practically implementable. We introduce an empirical version of (4), with an additional quadratic damping term to improve robustness:

[TABLE]

where $\mathcal{D}_{k}=\{(s_{i},a_{i},r_{i},s^{\prime}_{i},a^{\prime}_{i})\}_{i=1}^{N}$ is a batch of $N$ samples from $\mathcal{D}$ . Then (6) leads to the empirical GNTD update:

[TABLE]

as detailed in Algorithm 1. For large-scale function approximation class such as over-parameterized neural networks, the matrix inversion in (7) can be expensive. In this case, we provide a practically efficient implementation of GNTD using the Kronecker-factored Approximate Curvature (K-FAC) method. Please see the details in the Appendix.

3 Convergence Analysis

In this section, we provide a finite-time convergence analysis of GNTD method under linear, neural network, and general smooth function approximations for both population and stochastic updates. The detailed proof can be found in Appendices C and D.

Assumption 3.1.

The data distribution $\mu$ is the stationary distribution over the state-action pairs under the policy $\pi$ .

We write $U\!:=\!\text{diag}(\mu)$ as an $|\mathcal{S}||\mathcal{A}|$ -dimensional diagonal matrix, whose $(s,a)$ -th diagonal element is $\mu(s,a)$ . Denote $\langle f,g\rangle_{\mu}\!:=\!\sum_{s,a}\mu(s,a)f(s,a)g(s,a)$ , and $\left\|f\right\|_{\mu}^{2}=\langle f,f\rangle_{\mu}$ . Then Assumption 3.1 indicates that $T^{\pi}$ is a contraction w.r.t. $\|\cdot\|_{\mu}$ (Tsitsiklis & Van Roy, 1996), that is,

[TABLE]

For the ease of the notation, we will denote $T^{\pi}$ as $T$ in later discussion. Instead of an $|\mathcal{S}|\times|\mathcal{A}|$ matrix, we view $Q(\theta)$ as an $|\mathcal{S}\|\mathcal{A}|\times 1$ column vector, with $(s,a)$ being a multi-index arranged in the lexicographical order.

3.1 Iteration Complexity for Population Update

3.1.1 Linear Approximation

Consider the linear function approximation with $\theta\in\mathbb{R}^{d}$ :

[TABLE]

Let $\Phi\in\mathbb{R}^{|\mathcal{S}\|\mathcal{A}|\times d}$ be the collection of all feature vectors, whose $(s,a)$ -th row equals $\phi(s,a)^{\top}$ . Without loss of generality, we assume $\|\phi(s,a)\|\leq 1$ for any $(s,a)$ . Then the minimizer $\theta^{*}$ of the MSBE (1) satisfies the projected Bellman equation

[TABLE]

where $\Pi_{\mu}$ is the projection to the subspace $\{\Phi x:x\in\mathbb{R}^{d}\}$ under the inner product $\langle\cdot,\cdot\rangle_{\mu}$ . We denote $Q^{*}:=\Phi\theta^{*}$ as the optimal linear approximator.

In the under-parameterized regime where $d<|\mathcal{S}\|\mathcal{A}|$ , we make the following assumption for the feature covariance matrix, in accordance with (Bhandari et al., 2018).

Assumption 3.2.

Define the feature covariance matrix as $\Sigma:=\mathbb{E}_{(s,a)\sim\mu}\left[\phi(s,a)\phi(s,a)^{\top}\right]$ . We assume $\Sigma\succ 0$ and denote $\lambda_{0}>0$ as its minimum eigenvalue.

Then the next theorem provides the convergence rate of linear GNTD in the under-parameterized regime, which matches the standard rate of linear TD under the same assumption (Bhandari et al., 2018).

Theorem 3.3.

Suppose Assumptions 3.1 and 3.2 hold. If we set $\beta=\frac{(1-\gamma)\lambda_{0}}{4}$ in the population GNTD update (3), then

[TABLE]

If the true Q-function is realizable, namely,

[TABLE]

then the intrinsic error term $\|\Pi_{\mu}Q^{\pi}-Q^{\pi}\|_{\mu}$ in Theorem 3.3 diminishes and the estimated Q-functions $Q(\theta^{K})$ converges linearly to the true Q-function $Q^{\pi}$ .

Next, as a warm up for over-parameterized neural network function approximation, we will analyze the linear GNTD in the over-parameterized regime where $d\geq|\mathcal{S}\|\mathcal{A}|$ . In this case, the Q-function $Q^{\pi}$ is always realizable and (8) holds true. However, we should also notice that, Assumption 3.2 will no longer hold since the feature covariance matrix is at most rank $|\mathcal{S}\|\mathcal{A}|$ while the matrix dimension $d$ can be much larger. As a result, the standard convergence rate of population linear TD decays to sublinear convergence (Bhandari et al., 2018). On the contrary, the population version of GNTD will still be able to retain a linear convergence rate under very mild condition.

Recall that $U=\text{diag}(\mu)$ is a diagonal matrix. Since $\Sigma=\Phi^{\top}U\Phi$ is no longer positive definite, further discussion is needed. Let $\Phi_{\mu}=U^{1/2}\Phi$ , and we introduce the following assumption.

Assumption 3.4.

The $\mu$ -weighted Gram matrix $\Phi_{\mu}\Phi_{\mu}^{\top}\succ 0$ and we denote $\lambda_{0}>0$ as its minimum eigenvalue.

Let $\Omega:=\text{supp}(\mu)$ be the support of the distribution $\mu$ . Then Assumption 3.4 is equivalent to requiring that the feature vectors are linearly independent and the support $\Omega$ covers the full $\mathcal{S}\times\mathcal{A}$ . In general, $\Omega=\mathcal{S}\times\mathcal{A}$ is a very strong assumption. However, it is not necessary for our analysis. For a policy $\pi$ , under mild condition that the state transition Markov chain is aperiodic and irreducible, then $\mu(s,a)=0$ iff. $\pi(a|s)=0$ . Then the Bellman equation is closed on $\Omega$ since it does not involve $Q(s,a)$ for $(s,a)\notin\Omega$ :

[TABLE]

This allows us to only care about the MSBE on $\Omega$ :

[TABLE]

where $||f||_{\Omega,\mu}^{2}\!:=\!\sum_{(s,a)\in\Omega}\mu(s,a)|f(s,a)|^{2}$ . Therefore, Assumption 3.4 can actually be relaxed to requiring the $\mu$ -weighted Gram matrix on $\Omega$ to be positive definite. Though we will have no guarantee for $|Q(s,a;\theta)-Q^{\pi}(s,a)|$ for $(s,a)\notin\Omega$ in this case, it is not an issue. For example, when policy evaluation is applied as a built-in module of actor-critic methods, the Q-value outside $\Omega$ will never appear in the policy gradient formula. Therefore, throughout this paper, we will assume $\Omega=\mathcal{S}\times\mathcal{A}$ for the ease of discussion.

Theorem 3.5.

Consider the over-parameterized linear GNTD under Assumptions 3.1 and 3.4. If we set $\beta\in(0,1)$ in the population update (3), then

[TABLE]

Due to the absence of a positive definite feature covariance matrix in the over-parameterized regime, existing results of linear TD (Bhandari et al., 2018) requires $\mathcal{O}\big{(}\frac{1}{(1-\gamma)^{2}\varepsilon^{2}}\big{)}$ iterations to guarantee $\|Q(\theta^{K})-Q^{\pi}\|_{\mu}\leq\mathcal{O}(\varepsilon)$ , while GNTD manages to obtain a linear convergence that only requires $\mathcal{O}(\frac{1}{1-\gamma}\log\frac{1}{\varepsilon})$ iterations for finding $\epsilon$ -accurate solutions.

3.1.2 Neural Network Approximation

Compared to the over-parameterized linear approximation, a more natural way to parameterize the Q-function is the neural network. Consider a two-layer neural network:

[TABLE]

where $\phi:\mathcal{S}\times\mathcal{A}\to\mathbb{R}^{d}$ is the feature mapping, $\theta\!:=\!(\theta_{1}^{\top},\cdots,\theta_{m}^{\top})^{\!\top}\!\in\mathbb{R}^{m\times d},b:=(b_{1},\cdots,b_{m})^{\!\top}\in\mathbb{R}^{m}$ are the weight matrices, and $\psi(z):=\max\{0,z\}$ is the Relu activation. Similar to linear function approximation, we assume $\|\phi(s,a)\|\leq 1$ for any $(s,a)$ . We denote $\theta^{k}_{r}$ the value of $\theta_{r}$ in iteration $k$ . For each $r$ , we initialize the weights $\theta^{0}_{r}\sim\mathcal{N}(0,\nu^{2}I)$ and $b_{r}\sim\rm{Unif}\{-1,+1\}$ . The parameter $b$ will not be trained during the optimization. According to (Du et al., 2018), we make the following assumption to ensure the positive definiteness of the $\mu$ -weighted Gram matrix.

Assumption 3.6.

For all pairs $(s,a)\neq(s^{\prime},a^{\prime})$ , we assume $\phi(s,a)\nparallel\phi(s^{\prime},a^{\prime})$ . Moreover, We assume $\mu(s,a)>0$ to simplify the discussion.

Similar to Assumption 3.4, the requirement that the support $\Omega=\text{supp}(\mu)$ covers the whole $\mathcal{S}\times\mathcal{A}$ can be relaxed by only considering the MSBE on $\Omega$ . Other settings in Assumption 3.6 imply independence and boundedness between two feature vectors. The next theorem establishes the global convergence rate of neural GNTD when it follows population update.

Theorem 3.7.

Suppose Assumptions 3.1 and 3.6 hold. If we set $\beta\!\in\!(0,1)$ in the population update (3), and the network width $m=\Omega\left(\frac{|\mathcal{S}|^{3}|\mathcal{A}|^{3}}{\delta^{2}(1-\gamma)^{2}}\right)$ , $\delta\!\in\!(0,1)$ , then w.p. $1-\delta$ , we have

[TABLE]

For over-parameterized neural networks, the feature covariance matrix will always be rank-deficient while the Gram matrix can be positive definite under proper initialization. Effectively exploiting this property significantly distinguishes the analysis of GNTD from the existing TD-type methods with neural approximation. As a result, to find an $\epsilon$ -accurate Q-function approximator, GNTD needs $\mathcal{O}\left(\frac{1}{1-\gamma}\log\frac{1}{\varepsilon}\right)$ iterations, while $\mathcal{O}\left(\frac{1}{(1-\gamma)^{2}\varepsilon^{2}}\right)$ iterations are required by neural TD (Cai et al., 2019).

3.1.3 General Smooth Function Approximation

In this section, we consider the smooth function approximation that satisfies the following properties.

Assumption 3.8.

For all $(s,a)$ , the function $Q(s,a;\cdot)$ is uniformly bounded by $M$ , $L_{1}$ -Lipschitz, and $L_{2}$ -smooth:

[TABLE]

Analogous to Assumption 3.4, we make the following requirement for the Jacobian matrix $J_{Q}(\theta)$ .

Assumption 3.9.

$\exists\lambda_{0}>0$ such that for any $\theta$ , we have $J_{Q}(\theta)^{\!\top}UJ_{Q}(\theta)\succeq\lambda_{0}I$ , where $U:=\text{diag}(\mu)$ is diagonal.

When $Q(\cdot)$ is linear, Assumption 3.9 reduces to Assumption 3.4. It actually implies that the objective function $L(\theta^{k},d)$ (cf. (4)) of the $d^{k}$ subproblem of the population update (3) is $2\lambda_{0}$ -strongly convex. Let us define the worst-case optimal fitting error of $L(\theta,d)$ over the parameter space $\mathcal{F}$ of $\theta$ as

[TABLE]

then we have the following theorem.

Theorem 3.10.

Suppose Assumptions 3.1, 3.8, and 3.9. Then for population update (3), there exists a constant $M_{1}>0$ independent of $\beta$ and $K$ such that

[TABLE]

If we set $\beta=(1-\gamma)\varepsilon$ , then $\|Q(\theta^{K})-Q^{\pi}\|_{\mu}\leq\mathcal{O}(\varepsilon+\varepsilon_{\mathcal{F}})$ after $K=\mathcal{\mathcal{O}}\Big{(}\frac{1}{(1-\gamma)^{2}\varepsilon}\log\frac{1}{\varepsilon}\Big{)}$ iterations.

3.2 Sample Complexity for Stochastic Update

In this section, we analyze the convergence and complexity result of the more practical stochastic GNTD method (7), under linear, neural network, and general smooth function approximations. In this scheme, each iteration is constructed by sampling a batch of $N$ data tuples from distribution $\mathcal{D}$ .

3.2.1 Linear Approximation

We start with the under-parameterized linear approximation case, for which the convergence and complexity is well-studied for TD method under Assumption 3.2. In this situation, the main technique for establishing a finite sample convergence for stochastic GNTD is to exploit the concentration inequality and analyze the proposed method as the mean path population update with controllable errors. In particular, to facilitate the application of concentration inequality, we will require the iteration sequence $\{\theta^{k}\}_{k=0}^{K}$ to be bounded. In fact, Assumption 3.2 indicates that

[TABLE]

Therefore, by exploiting the fast convergence of GNTD we can inductively provide a uniform bound for $\{\theta^{k}\}_{k=0}^{K}$ with high probability while simultaneously establishing the convergence of $\|Q(\theta^{k})-Q^{*}\|_{\mu}$ . As a result, we derive the following theorem for stochastic GNTD.

Theorem 3.11.

Suppose Assumptions 3.1 and 3.2 hold. For any $\varepsilon\ll\|Q^{0}-Q^{*}\|_{\mu}$ , if we set $\beta=\frac{(1-\gamma)\lambda_{0}}{18}$ , the damping rate $\omega=\frac{(1-\gamma)\lambda_{0}^{2}}{18}$ and the sample size $N=\mathcal{O}\left(\frac{1}{(1-\gamma)^{2}\varepsilon^{2}}\log\frac{K}{\delta}\right)$ for each iteration, then Algorithm 1 satisfies

[TABLE]

w.p. $1-\delta$ , where $\tau_{1}>0$ is a constant.

Suppose the Q-function is realizable, then setting $K=\mathcal{O}(\frac{1}{(1-\gamma)^{2}}\log\frac{1}{\varepsilon})$ in Theorem 3.11 yields $\|Q(\theta^{K})-Q^{*}\|_{\mu}\leq\mathcal{O}(\epsilon)$ . Such a complexity $\tilde{\mathcal{O}}(\frac{1}{\varepsilon^{2}})$ complexity matches the result of stochastic linear TD (Bhandari et al., 2018).

Next, we consider the over-parameterized case. In this scenario, we still need a uniform bound of the iteration sequence $\{\theta^{k}\}_{k=0}^{K}$ to facilitate the concentration inequality. However, it is not straightforward due to the lack of (11), which is based on Assumption 3.2. Fortunately, on the one hand, Assumption 3.4 enables us to bound $\|\theta^{k+1}-\theta^{k}\|$ with $\|Q(\theta^{k})-TQ(\theta^{k})\|_{\mu}$ . On the other hand, a fast convergence of $\|Q(\theta^{k})-Q^{\pi}\|_{\mu}$ in GNTD can further indicates a fast convergence of $\|Q(\theta^{k})-TQ(\theta^{k})\|_{\mu}$ . Therefore, we can inductively provide a uniform bound for $\{\theta^{k}\}_{k=0}^{K}$ by $\|\theta^{k}-\theta^{0}\|\leq\sum_{i=1}^{k-1}\|\theta^{i}-\theta^{i-1}\|$ while proving a fast convergence of $\|Q(\theta^{k})-Q^{\pi}\|_{\mu}$ .

Theorem 3.12.

Consider the over-parameterized linear GNTD under Assumptions 3.1 and 3.4. For any $\varepsilon\ll\|Q^{0}-Q^{\pi}\|_{\mu}$ , if we set $\beta\in(0,1),\omega=\mathcal{O}\left((1-\gamma)^{2}\varepsilon\right)$ and the sample size $N=\mathcal{O}(\frac{1}{(1-\gamma)^{4}\varepsilon^{2}}\log\frac{K}{\delta})$ for each iteration, then the output of Algorithm 1 satisfies

[TABLE]

w.p. $1-\delta$ , where $\tau_{2}>0$ is a constant.

In the absence of Assumption 3.2, the existing analysis of linear TD (Bhandari et al., 2018) only provides a sub-optimal $\mathcal{O}\left(\frac{1}{\varepsilon^{4}}\right)$ sample complexity. On the contrary, our GNTD method can guarantee an $\mathcal{O}\left(\frac{1}{\varepsilon^{2}}\log\frac{1}{\varepsilon}\right)$ sample complexity by utilizing the over-parameterizaiton structure that allows Assumption 3.4, which yields a significantly advantage over TD method.

3.2.2 Neural Network Approximation

Now let us proceed to the discussion of GNTD on neural network approximation. By utilizing a similar approach to over-parameterized linear GNTD, one can prove a uniform bound on the iteration sequence when $Q(\theta)$ is parameterized by the neural network (9).

Theorem 3.13.

Suppose Assumptions 3.1 and 3.6 hold. If we set $\beta,\omega\!\in\!(0,1)$ and the network width $m=\Omega\left(\frac{|\mathcal{S}|^{3}|\mathcal{A}|^{3}}{\delta^{2}(1-\gamma)^{2}}\right)$ for each iteration , then the output of Algorithm 1 satisfies

[TABLE]

w.p. $1-\delta$ , where $\tau_{3}>0$ is some constant.

By setting $\omega=\mathcal{O}((1-\gamma)^{2}\varepsilon),N=\mathcal{O}(\frac{1}{(1-\gamma)^{4}\varepsilon^{2}})$ and $K=\mathcal{O}(\frac{1}{1-\gamma}\log\frac{1}{\varepsilon})$ , Theorem 3.13 indicates an $\mathcal{O}\left(\frac{1}{\varepsilon^{2}}\log\frac{1}{\varepsilon}\right)$ sample complexity of neural GNTD, which is much better than the $\mathcal{O}\left(\frac{1}{\varepsilon^{4}}\right)$ sample complexity of neural TD (Cai et al., 2019).

3.2.3 General Smooth Function Approximation

For smooth functions, the key is to bound the gap between the solutions of (4) and (6). There are many ways to deal with (6), including Empirical Risk Minimization (Shalev-Shwartz et al., 2009) and ProxBoost (Davis et al., 2021). In particular, ProxBoost provides the sate-of-the-art sample complexity w.r.t. the failure probability $\delta$ and the problem condition number. Therefore, we will adopt ProxBoost to enhance the solution to the ERM subproblem (6).

Theorem 3.14.

Suppose Assumptions 3.1, 3.8, and 3.9 hold. If we set $\beta,\omega\in(0,1)$ , then w.p. $1-\delta$ , the output of Algorithm 1 satisfies

[TABLE]

for some constants $M_{1},M_{2}$ .

To our best knowledge, there is no explicit finite-time convergence analysis of TD for smooth functions. By choosing the damping rate $\omega=\mathcal{O}((1-\gamma)^{2}\varepsilon^{2})$ and the step size $\beta=\mathcal{O}((1-\gamma)\varepsilon)$ , stochastic GNTD can output an $\epsilon$ -accurate Q-function approximator with $\mathcal{O}\left(\frac{1}{(1-\gamma)^{2}\varepsilon}\log\frac{1}{\varepsilon}\right)$ iterations and in total $\mathcal{O}\left(\frac{1}{\varepsilon^{3}}\left(\log\frac{1}{\varepsilon}\right)^{2}\right)$ samples.

4 Experiments

Finally, we conduct a series of experiments over the OpenAI Gym (Brockman et al., 2016) tasks and demonstrate the efficiency of GNTD method under a variety of settings.

In details, we first examine the advantage of GNTD over TD in on-policy reinforcement learning setting, where the policy evaluation serves as a built-in module of the policy iteration method. Second, we also consider a few offline RL tasks, where we extend the proposed method to the Q-learning settings. All the compared learning algorithms are trained without exploration. We compare the performance of different algorithms in terms of the Bellman error and the final return.

4.1 Policy Optimization with GNTD Method

First, we present the experiments where GNTD and TD are executed as built-in modules of an entropy regularized (Haarnoja et al., 2018) policy iteration method. Typically, policy iteration is divided into two steps: policy evaluation and policy improvement. In details, given an initial policy $\pi_{0}$ , our agents collect a data batch and then perform a 25-step policy evaluation to obtain $Q^{\pi_{0}}$ , by either GNTD or TD method. Then, we take a 1-step policy gradient (PG) ascent to the entropy regularized total reward:

[TABLE]

where $H(\pi(\cdot\!\mid\!s))=\mathbb{E}_{a\sim\pi(\cdot\mid s)}[-\log\pi(a\!\mid\!s)]$ is the entropy of the density function $\pi(\cdot\!\mid\!s)$ . Then the agent execute the new policy to collect a new data batch, and loop through policy optimization until convergence.

For the policy function $\pi$ and state-action value function $Q$ , we employ two layer neural networks. For computational efficiency, we implement the GNTD alorithm with K-FAC method (see Appendix A). We set the damping rate $\omega=0.25$ and the learning rate $\beta=0.0003$ of the $Q$ -function.

Figure 1 shows the experimental results under several on-policy OpenAI gym environments. PG-GNTD and PG-TD refer to the policy optimization based on GNTD and TD, respectively. It can be observed that PG-GNTD converges faster than PG-TD in Hopper, Walker2d and Swimmer environments. It also fetches higher final rewards in all tasks.

4.2 Offline Reinforcement Learning Tasks

In this section, we present the experiments for both discrete and continuous offline RL tasks, where we will focus on optimizing rather than evaluating the policy. We compare the performance of our method against several benchmarks in terms of the Bellman error and average return.

4.2.1 Discrete Action Tasks

In this experiment, we present the experimental results under the OpenAI Gym CartPole-v1 and Acrobot-v1 environments. The tested algorithms includes TD, DQN (Mnih et al., 2013), GNTD, and GNDQN. In particular, both TD and GNTD are adapted to the Q-learning setting where the Bellman operator is replaced with optimal Bellman operator. GNDQN is a variant of GNTD method that incorporates a DQN-style momentum update to the target network, while taking Gauss-Newton steps to update the weight matrices. Furthermore, the four algorithms use the same neural network architecture and has the same learning rate of $\beta=0.0003$ . The size of all offline datasets is chosen as $100000$ , and we set the damping rate to be $\omega=0.25$ .

Compared to the on-policy setting, offline RL requires strong conditions on the data distribution in order to obtain an optimal or near optimal policy. According to (Fan et al., 2020; Agarwal et al., 2020), we consider the following types of datasets:

•

Replay datasets. Train an online policy until convergence and use all samples during training.

•

Medium-replay datasets. Train an online sub-optimal policy and use all samples during training.

From Figure 2 and Table 3, it can be observed that GNTD outperforms TD in terms of both convergence speed, final reward, and Bellman error. After incorporating the momentum into the target network update, our GNDQN also dominates DQN in all the reported performance measures.

4.2.2 Continuous Tasks

Finally, we examine the performance of GNTD on the OpenAI Gym MuJoCo tasks using D4RL datasets. In these tasks, we propose a GNTD3+BC variant our method that merges GNTD with TD3+BC (Fujimoto & Gu, 2021), where a behavior cloning term is added to regularize the policy. See Appendix E for details.

Table 2 shows the final numerical results for GNTD3+BC and other baselines, including BC (behavioral cloning), BCQ (Fujimoto et al., 2019), CQL (Kumar et al., 2020) and TD3+BC (Fujimoto & Gu, 2021). Compared with TD3+BC, GNTD3+BC has higher final returns and lower variance in multiple environmental settings.

Appendix A Kronecker-Factored Approximate Curvature (K-FAC) Method for GNTD

In this section, we introduce the Kronecker-Factored Approximate Curvature (K-FAC) method (Martens & Grosse, 2015), which provides an efficient implementation of neural GNTD.

The update formula of GNTD (7) provided in Section 2.2, although differs from the natural gradient method in terms of expression (natural gradient method requires the assumption that the loss function is the negative log-likelihood of normal distribution), they have a similar functional structure.

For a feed-forward deep neural network (DNN) with $L$ layers, we denote the weight matrices as $\theta_{l}$ of $l$ -th layer ( $l=1,2,\cdots,L$ ) and we denote the ReLU activation function as $\psi(\cdot)$ . For any state-action pair $(s,a)$ , the output $Q(s,a;\theta)$ is in general a non-convex function of the weights $\theta=\big{[}\theta_{1}^{\top},\ldots,\theta_{L}^{\top}\big{]}^{\top}$ . Alternatively, $\theta_{l}$ can also be viewed as an $\mathbb{R}^{(n_{l-1}+1)\times n_{l}}$ parameter matrix that maps $n_{l-1}$ -dimensional vectors to ${n_{l}}$ -dimensional vectors. We define $\textbf{\rm Mat}(\cdot)$ as a matrix form of the vector parameters related to the number of neurons in a single layer and define $\textbf{\rm Vec}(\cdot)$ as a flattened vector form of the matrix parameters. The following algorithm describes network’s forward and backward pass for a single state-action pair $(s,a)$ .

From Algorithm 2, with the weights of the neural network being $\theta^{k}$ , we let $p^{k}_{l}$ and $q^{k}_{l}$ denote the forward vector and backward vector of the $l$ -th layer, respectively, and we define the matrices

[TABLE]

For a training dataset that contains multiple data-points, the K-FAC method attempts to approximate the matrix $H(\theta^{k})$ in (5) by the following block-diagonal matrix

[TABLE]

After incorporating the identity matrix originated from the Levenberg-Marquardt method, then we approximately calculate the matrix inversion as follows

[TABLE]

For the stochastic sampling case where the expectation are approximated by the sample averages, we let $\hat{P}^{k}_{l}$ and $\hat{Q}^{k}_{l}$ be the empirical estimators of $P^{k}_{l}$ and $Q^{k}_{l}$ , which are given as

[TABLE]

where $P^{k}_{l}(i)$ and $Q^{k}_{l}(i)$ are constructed by running Algorithm 2 for the $i$ -th data point and utilize (12). Similarly, let $\hat{g}^{k}$ be an estimator of the semi-gradient $g(\theta^{k})$ , and let $\hat{g}^{k}_{l}$ be the semi-gradient of the $l$ -th layer. Then the descending direction for the $l$ -th layer is

[TABLE]

Then we can naturally get the expression for the parameter update:

[TABLE]

Appendix B Extending GNTD to Q-learning Algorithms

As mentioned in Section 4.2.1, we extend the GNTD method to offline Q-learning algorithms, where we consider policy optimization instead of just policy evaluation. Specifically, with $\mathcal{D}_{k}=\{(s_{i},a_{i},r_{i},s^{\prime}_{i})\}_{i=1}^{N}$ being a batch of $N$ tuples collected from the distribution $(s_{i},a_{i})\sim\mu$ and $s_{i}^{\prime}\sim\mathbb{P}(\cdot|s_{i},a_{i})$ , we define stochastic estimator of the semi-gradient as

[TABLE]

for any tuple $\xi=(s,a,r,s^{\prime})$ . The TD error term $\delta^{k}(\xi)$ in (16) is induced by the Bellman optimality operator instead of the Bellman operator:

[TABLE]

Here $\theta^{k}_{targ}=\theta^{k}$ in TD and $\theta^{k}_{targ}=(1-\tau)\theta^{k-1}_{targ}+\tau\theta^{k}$ in DQN (Mnih et al., 2013). Combining with the curvature matrix, we design the GNTD and GNDQN learning algorithms. See Algorithm 3 for more details.

Appendix C Analysis of Population Updates in Section 3.1

Throughout the discussion of population update, we will write $Q^{k}=Q(\theta^{k})$ and $\delta^{k}=Q^{k}-TQ^{k}$ in the $k$ -th iteration.

C.1 Proof of Theorem 3.3 (Under-parameterized Linear GNTD)

Recall that $U=\text{diag}(\mu)$ is a diagonal matrix. For linear parameterization where $Q(\theta)=\Phi\theta$ , the population GNTD update (3) has an explicit formula:

[TABLE]

where

[TABLE]

Let $Q^{*}=Q(\theta^{*})$ be the optimal linear approximator under the feature matrix $\Phi$ , then we introduce two important lemmas for the analysis of under-parameterized linear function approximation.

Lemma C.1.

(Bhandari et al., 2018)* Under Assumption 3.1, we have that*

[TABLE]

Lemma C.2.

(Cai et al., 2019)* Under Assumption 3.1, we have that $\|Q^{*}-Q^{\pi}\|_{\mu}\leq\frac{1}{1-\gamma}\|\Pi_{\mu}Q^{\pi}-Q^{\pi}\|_{\mu}.$ *

Now we are ready to prove Theorem 3.3.

Proof.

Recall the update formula from (17), then we have

[TABLE]

By Assumption 3.2 and Lemma C.1, we have

[TABLE]

Choosing $\beta=\frac{(1-\gamma)\lambda_{0}}{4}$ yields

[TABLE]

Then by Lemma C.2, we have

[TABLE]

Thus we complete the proof. ∎

C.2 Proof of Theorem 3.5 (Over-parameterized Linear GNTD)

By Assumption 3.4, the $\mu$ -weighted Gram matrix $\Phi_{\mu}\Phi_{\mu}^{\top}$ is nonsingular, the least square subproblem of the population update (3) has an explicit solution

[TABLE]

where $\delta^{k}=Q^{k}-TQ^{k}$ is the population TD error vector. Consequently, we have

[TABLE]

Therefore,

[TABLE]

This completes the proof.

C.3 Proof of Theorem 3.7 (Neural GNTD)

Let $J^{k}=J_{Q}(\theta^{k})$ be the Jacobian matrix. For the neural network function approximation (9), we can rewrite (3) as

[TABLE]

where we denote $J_{\mu}^{k}=U^{\frac{1}{2}}J^{k}$ . Let $x_{i}=\sqrt{\mu(s_{i},a_{i})}\phi(s_{i},a_{i}),\forall 1\leq i\leq|\mathcal{S}||\mathcal{A}|$ , where $i$ corresponds to the $i$ -th diagonal element of the diagonal matrix $U$ . To simplify the notation, let $n=|\mathcal{S}||\mathcal{A}|$ . Then we define the $(i,j)$ -th element of the $\mu$ -weighted expected Gram matrix $G^{\infty}\in\mathbb{R}^{n\times n}$ as follows:

[TABLE]

where $\mathbf{1}\{\cdot\}$ is the indicator function and the expectation is taken w.r.t. the Gaussian initialization of the weights. Additionally, we define the $\mu$ -weighted Gram matrix in the $k$ -th iteration as $G^{k}=J^{k}_{\mu}(J^{k}_{\mu})^{\top}$ . Let $c=\frac{\max_{i}\|x_{i}\|}{\min_{i}\|x_{i}\|}=\frac{\min_{(s,a)}\|\sqrt{\mu(s,a)}\phi(s,a)\|}{\max_{(s,a)}\|\sqrt{\mu(s,a)}\phi(s,a)\|},$ (Du et al., 2018) suggests that $G^{k}\succ 0$ can be ensured by setting the network width $m$ to be an appropriate polynomial of $n$ , $c$ , and $\lambda_{\min}(G^{\infty})^{-1}$ . However, the constant $c$ is only related to the network width $m$ , and does not affect the convergence rate of GNTD. Therefore, to simplify the discussion, we omit the constant $c$ in subsequent proofs.

Lemma C.3.

(Du et al., 2018)* Suppose Assumption 3.6 holds, then $G^{\infty}\succ 0$ . Define $\lambda_{0}:=\lambda_{\min}(G^{\infty})>0$ . If the network width $m=\Omega\left(\frac{n\log(n/\delta)}{\lambda_{0}}\right)$ , then we have $\lambda_{\min}(G^{0})\geq$ $\frac{3}{4}\lambda_{0}$ w.p. $1-\delta$ .*

Lemma C.4.

(Zhang et al., 2019)* Suppose Assumption 3.6 holds. For any $\theta$ , denote $J_{\mu}:=U^{\frac{1}{2}}J_{Q}(\theta)$ , then w.p. at least $1-\delta$ , we have $\|J_{\mu}-J_{\mu}^{0}\|_{2}^{2}\leq\frac{2nR^{2/3}}{\nu^{2/3}\delta^{2/3}m^{1/3}}$ for all $\theta$ satisfying $\|\theta-\theta^{0}\|_{2}\leq R$ .*

The above lemma shows that as long as the parameters $\theta$ is close to the random initialization, the corresponding $\mu$ -weighted Jacobian matrix is also closed to the initial $\mu$ -weighted Jacobian matrix $J_{\mu}^{0}$ . Thus we expect $G^{k}\succ 0$ as long as the iteration $\theta^{k}$ stays close enough to $\theta^{0}$ . As a result, we have the following lemma.

Lemma C.5.

Suppose Assumptions 3.1 and 3.6 hold. If the network width satisfies $m=\Omega\Big{(}\frac{n^{3}}{\nu^{2}\lambda_{0}^{4}\delta^{2}(1-\gamma)^{2}}\Big{)}$ and $\theta$ satisfies $\|\theta-\theta^{0}\|_{2}\leq\frac{6\|Q^{0}-TQ^{0}\|_{\mu}}{(1-\gamma)\sqrt{\lambda_{\min}(G^{0})}}$ , then $w.p.\ 1-\delta$ , we have $\|J_{\mu}-J_{\mu}^{0}\|_{2}\leq\frac{C}{3}\sqrt{\lambda_{\min}(G^{0})}$ with $C=\mathcal{O}(m^{-1/6})$ being a constant, and $\lambda_{\min}(G)\geq\frac{4}{9}\lambda_{\min}(G^{0})$ with $G:=J_{\mu}J_{\mu}^{\top}$ being the $\mu$ -weighted Gram matrix at $\theta$ .

Proof.

First, by setting $R=\frac{6\|Q^{0}-TQ^{0}\|_{\mu}}{(1-\gamma)\sqrt{\lambda_{\min}(G^{0})}}$ in Lemma C.4, then w.p. $1-\delta$ we have

[TABLE]

According to the initialization $\theta^{0}$ and $b$ , we let $\mathbb{E}_{\theta^{0},b}$ denote the expectation w.r.t. $\theta_{r}^{0}\sim\mathcal{N}(0,\nu^{2}I)$ and $b_{r}\sim\rm{Unif}\{-1,+1\}$ for each $r\in[m]$ . Then under Assumption 3.1, we have

[TABLE]

where (i) follows the fact that $(s,a)$ and $(s^{\prime},a^{\prime})$ have the same marginal distribution, (ii) follows the independence among $b_{r}$ ’s and the fact that $\mathbb{E}[b_{r}]=0,\forall r$ , and (iii) follows the expectation of $\|\theta_{r}^{0}\|^{2}$ where $d$ is the dimension of the feature mapping. Thus $\frac{6\|Q^{0}-TQ^{0}\|_{\mu}}{(1-\gamma)\sqrt{\lambda_{\min}(G^{0})}}=\mathcal{O}\Big{(}\frac{\sqrt{1/\delta}}{1-\gamma}\Big{)}$ w.p. $1-\delta$ by using Markov inequality. Let $m=\Omega\Big{(}\frac{n^{3}}{\nu^{2}\lambda_{0}^{4}\delta^{2}(1-\gamma)^{2}}\Big{)}$ , then $\|J_{\mu}-J_{\mu}^{0}\|_{2}\leq\frac{C}{3}\sqrt{\lambda_{\min}(G^{0})}$ , where $C=\mathcal{O}(m^{-1/6})$ is a constant.

Next, based on the inequality that $\sigma_{\min}(A+B)\geq\sigma_{\min}(A)-\sigma_{\max}(B)$ where $\sigma$ denotes singular value, we have

[TABLE]

where the last inequality uses the fact that $C\leq 1/3$ when the network width $m$ is large enough. ∎

Lemma C.6.

Conditioning on the success of the high probability event in Lemma C.5, if $\|\theta^{k}-\theta^{0}\|_{2}\leq\frac{6\|Q^{0}-TQ^{0}\|_{\mu}}{(1-\gamma)\sqrt{\lambda_{\min}(G^{0})}}$ , then the following inequalities hold

[TABLE]

and

[TABLE]

where $C=\mathcal{O}(m^{-1/6})$ is the constant defined in Lemma C.5.

Proof.

We only need to verify the first inequality. Let $\theta(s):=s\theta^{k+1}+(1-s)\theta^{k}$ and calculate

[TABLE]

Next we estimate the bound on the norm of the second term $①$ . Note that

[TABLE]

where (i) is due to $\|J_{\mu}-J_{\mu}^{0}\|_{2}\leq\frac{C}{3}\sqrt{\lambda_{\min}(G^{0})}$ and (ii) is due to $\lambda_{\min}(G)\geq\frac{4}{9}\lambda_{\min}(G^{0})$ . Thus we complete the proof. ∎

Now we are ready to provide the proof of the Theorem 3.7.

Proof.

Note that the key results of Lemma C.5 and C.6 all rely on the condition that the analyzed point $\theta$ stays close to $\theta^{0}$ . Therefore, to prove this theorem with the above lemmas, we will need to prove the following argument by induction:

[TABLE]

Then the final convergence rate result will be automatically covered as a byproduct in the proof of (20). When $k=0$ , (20) is obviously true. Now, suppose (20) holds for all $k=0,1,\cdots,t$ , we prove this argument for $k=t+1$ .

First, let us denote $T_{\beta}Q:=(1-\beta)Q+\beta TQ$ . Conditioning on the success of the high probability event in Lemma C.5, then Lemma C.6 and (20) indicates that (18) and (19) hold for $k=0,1,...,t$ . Consequently, for any $0\leq k\leq t$ , we have

[TABLE]

where $C=\mathcal{O}(m^{-1/6})$ . Let $m$ be large enough such that $C\leq\frac{1-\gamma}{2(1+\gamma)}$ , then

[TABLE]

As a result, for $k=0,1,\cdots,t$ , we have

[TABLE]

where (i) is due to Lemma C.6. Thus we have that $\|TQ^{k+1}-Q^{k+1}\|_{\mu}\leq\left(1-\frac{(1-\gamma)\beta}{2}\right)^{k+1}\|TQ^{0}-Q^{0}\|_{\mu}$ for $\forall k\leq t$ .

Consequently, for $k=t+1$ , we have

[TABLE]

where (i) is due to Lemma C.5. Hence we complete the proof of (20). As a byproduct, we have (22) for all iterations, which further implies the convergence rate result: $\|Q^{K}-Q^{\pi}\|_{\mu}\leq\left(1-\frac{(1-\gamma)\beta}{2}\right)^{K}\|Q^{0}-Q^{\pi}\|_{\mu}.$ ∎

*Remark C.7**.*

Note that (Zhang et al., 2019) theoretically verifies the efficient performance of the natural gradient method (or Gauss-Newton method) in deep learning. We find that this technique also works well on the semi-gradient method of policy evaluation. Unlike classification or regression problems, neural GNTD retains the structure of the FQI well, exploiting the contraction property of the Bellman operator $T$ to obtain global convergence straightforwardly.

C.4 Proof of Theorem 3.10 (Nonlinear Smooth GNTD)

Proof.

Under Assumptions 3.8 and 3.9, the $d^{k}$ subproblem in the population update (3) has a closed form solution:

[TABLE]

where $J^{k}=J_{Q}(\theta^{k})$ is the Jacobian matrix and $\delta^{k}=Q^{k}-TQ^{k}$ is the population TD error. Consequently,

[TABLE]

Define the residual term $f_{k}(s,a)\!:=\!Q(s,a;\theta^{k+1})\!-\!Q(s,a;\theta^{k})\!-\!\left\langle\nabla Q(s,a;\theta^{k}),\theta^{k+1}\!-\!\theta^{k}\right\rangle$ . By Assumption 3.8, we have

[TABLE]

Recall the notation $T_{\beta}Q:=(1-\beta)Q+\beta TQ$ , with $\varepsilon_{\mathcal{F}}$ defined by (10), we have

[TABLE]

As a result, we have

[TABLE]

where (i) follows that $T_{\beta}$ is the contraction operator, $T_{\beta}Q^{\pi}=Q^{\pi}$ , and $\|Q^{k+1}-T_{\beta}Q^{k}\|_{\mu}$ is bounded as mentioned above. Thus we complete the proof. ∎

Appendix D Analysis of Stochastic Updates in Section 3.2

Before starting the analysis of stochastic GNTD, we introduce a few notation. In the $k$ -th iteration, we obtain a batch of data tuples of the form $\xi=(s,a,r,s^{\prime},a^{\prime})$ , we call this set of data tuples as $\mathcal{D}_{k}$ . For each $\xi\in\mathcal{D}_{k}$ , we define $\hat{g}(\theta^{k},\xi)=\delta^{k}(\xi)\cdot\nabla Q(s,a;\theta^{k})$ . Consequently, the semi-gradient estimator $\hat{g}^{k}$ defined in (7) can also be written as $\hat{g}^{k}=\frac{1}{|\mathcal{D}_{k}|}\sum_{\xi\in\mathcal{D}_{k}}\hat{g}(\theta^{k},\xi)$ . Let $\hat{\mu}^{k}$ be the empirical estimator of $\mu$ based on the dataset $\mathcal{D}_{k}$ , then the stochastic estimator $\hat{H}^{k}$ defined in (7) can be equivalently written as $\hat{H}^{k}=\hat{H}(\theta^{k})=(J^{k})^{\top}\hat{U}^{k}J^{k}$ , where $\hat{U}^{k}=\text{diag}(\hat{\mu}^{k})$ .

To analyze the convergence of stochastic GNTD, as mentioned in Section 3.2, it is necessary to ensure that $\|\theta^{k}\|$ is bounded for each $k$ . This is mainly because the semi-gradient estimator $\hat{g}(\theta^{k},\xi)$ is controlled by $\theta^{k}$ . Define

[TABLE]

Later on, we will discuss the high probability uniform upper bound of $\sigma_{k},\forall k$ under different settings, including under-parameterized linear functions, over-parameterized linear functions and neural network functions. As a result, let $\mathcal{F}^{k}$ be the sigma-algebra generated by the randomness until the iteration $\theta^{k}$ . Then we have $\text{Var}_{\xi}(\hat{g}(\theta^{k},\xi)\mid\mathcal{F}^{k})\leq\sigma_{k}^{2}.$ Based on such upper bounds, we introduce the following two lemmas.

Lemma D.1.

Let $\left\{S_{i}\right\}_{i=1}^{N}$ be a sequence of independent random vectors. Assume $\mathbb{E}S_{i}=\mathbf{0}$ and $\left\|S_{i}\right\|\leq\sigma,\,\forall i$ , then

[TABLE]

The proof of this result is straightforward. By $\|S_{i}\|\leq\sigma$ , we have $\mathbb{E}[\exp\{\|S_{i}\|^{2}/\sigma^{2}\}]\leq\exp\{1\}.$ Then applying Lemma 4.1 (Lan, 2020) proves this lemma.

Lemma D.2.

Suppose Assumption 3.8 holds true, then for any $\eta\in(0,1)$ and any iteration $\theta^{k}$ , we have $\|\hat{H}^{k}-H(\theta^{k})\|_{2}\leq\eta$ w.p. $1-\delta$ as long as the sample size $|\mathcal{D}_{k}|\geq\frac{6L_{1}^{4}}{\eta^{2}}\log\frac{2d}{\delta}$ .

Proof.

First, recall the definition that

[TABLE]

Define $S(\xi)=\dfrac{1}{|\mathcal{D}_{k}|}\left(\nabla Q(s,a;\theta)\nabla Q(s,a;\theta)^{\top}-J(\theta)^{\top}UJ(\theta)\right)$ , then it holds that $\mathbb{E}[S(\xi)]=0,\|S(\xi)\|_{2}\leq\frac{2L_{1}^{2}}{|\mathcal{D}_{k}|}$ and $\|E[S(\xi)S(\xi)^{\top}]\|\leq\frac{2L_{1}^{4}}{|\mathcal{D}_{k}|^{2}}.$ By Matrix Bernstein Inequality (Tropp et al., 2015), we have

[TABLE]

Choosing the batch size to satisfy $2d\cdot\exp\left\{\frac{-3|\mathcal{D}_{k}|\eta^{2}}{4L_{1}^{4}(3+\eta)}\right\}\leq\delta$ proves this lemma. ∎

D.1 Proof of Theorem 3.11 (Stochastic Under-parameterized Linear GNTD)

First of all, let us provide a few supporting lemmas. For linear approximation, the stochastic GNTD update (7) can be written as $\theta^{k+1}=\theta^{k}+\beta\hat{d}^{k}$ with

[TABLE]

The following lemma provides a one-step progress for stochastic under-parameterized linear GNTD.

Lemma D.3.

Consider linear parameterization $Q(s,a;\theta)=\phi(s,a)^{\top}\theta$ . For any $\delta_{0}>0$ , we set $\beta=\frac{(1-\gamma)\lambda_{0}}{18}$ , $\omega=\frac{(1-\gamma)\lambda_{0}^{2}}{18}$ and the sample size $|\mathcal{D}_{k}|\geq\frac{6}{\omega^{2}}\log\frac{2d}{\delta_{0}}$ in iteration $k$ . Under Assumptions 3.1 and 3.2, then

[TABLE]

Proof.

According to the stochastic update of GNTD with linear approximation, we have

[TABLE]

By Assumption 3.2 and Lemma D.2, if the sample size $|\mathcal{D}_{k}|\geq\frac{6}{\eta^{2}}\log\frac{2d}{\delta_{0}}$ for any $\eta\in(0,1)$ , we have

[TABLE]

Let $\eta=\omega$ . Then we have $w.p.\ 1-\delta_{0}$ that

[TABLE]

and

[TABLE]

where (i) and (ii) are both due to (25) and Assumption 3.2. Next by Lemma D.1, we have

[TABLE]

Thus the second term on the right side of equation (24) can be estimated as

[TABLE]

$w.p.\ 1-2\delta_{0}$ , where (i) follows (D.1) and (ii) follows from Lemma C.1. The last term on the right side of equation (24) can be decomposed as follows

[TABLE]

Observe that the first term on the right side of equation (29) does not exceed $-(1-\gamma)\|Q^{k}-Q^{*}\|_{\mu}^{2}$ by Lemma C.1. The second term on the right side of equation (29) can be estimated as

[TABLE]

The last term on the right side of equation (29) can be estimated as

[TABLE]

where (i) and (iii) both follow the inequality that $xy\leq\frac{x^{2}}{4}+y^{2}$ , and (ii) follows (D.1). Choosing $\omega=\eta=\frac{(1-\gamma)\lambda_{0}^{2}}{18}$ yields

[TABLE]

$w.p.\ 1-2\delta_{0}$ . Plugging (D.1), (D.1) into (24) yields that

[TABLE]

∎

The above lemma shows that the one-step error is very much related to the sample size $|\mathcal{D}_{k}|$ . As long as $|\mathcal{D}_{k}|$ is sufficiently large, $\theta^{k}$ and $\sigma_{k}$ are both uniformly bounded for each $k$ . See Lemma D.4.

Lemma D.4.

Suppose Assumptions 3.1 and 3.2 hold. We define $\tau_{1}^{2}=9(1+\gamma^{2})\|\theta^{*}\|^{2}+9R_{\max}^{2}+\frac{12(1+\gamma)^{2}\|Q^{0}-Q^{*}\|_{\mu}^{2}}{\lambda_{0}}.$ Set $\varepsilon\ll\|Q^{0}-Q^{*}\|_{\mu},\beta=\frac{(1-\gamma)\lambda_{0}}{18},\omega=\frac{(1-\gamma)\lambda_{0}^{2}}{18}$ and the sample size $|\mathcal{D}_{k}|\geq\max\{\frac{36\tau_{1}^{2}}{(1-\gamma)^{2}\varepsilon^{2}},\frac{6}{\omega^{2}}\}\log\frac{2Kd}{\delta}$ for each iteration. Then w.p. $1-\delta$ , we have

[TABLE]

for any $\xi\sim\mathcal{D}$ and for all $k=0,1,\cdots,K$ .

Proof.

We will prove this lemma by induction. First of all, for any $\theta$ and any tuple $\xi$ , we have

[TABLE]

where (i) and (ii) are due to the fact that $\|x+y+z\|^{2}\leq 3(\|x\|^{2}+\|y\|^{2}+\|z\|^{2})$ , and (iii) is due to $\|\theta-\theta^{*}\|\leq\frac{1}{\sqrt{\lambda_{0}}}\|Q(\theta)-Q^{*}\|_{\mu}$ . Substituting $\theta=\theta^{0}$ into (D.1) proves (31) for $k=0$ .

Now, suppose (31) holds for $k=0,1,\cdots,t$ , then we prove this argument for $k=t+1$ . For $k=0,1,\cdots,t$ , we choose $\delta_{0}=\frac{\delta}{2K}$ in Lemma D.3 and the batch size $|\mathcal{D}_{k}|\geq\max\left\{\frac{36\tau_{1}^{2}}{(1-\gamma)^{2}\varepsilon^{2}},\frac{6}{\omega^{2}}\right\}\log\frac{2Kd}{\delta}$ , then we have $w.p.\ 1-2t\delta_{0}$ that

[TABLE]

for $k\leq t$ . Because $\varepsilon\!\ll\!\|Q^{0}\!-\!Q^{*}\|_{\mu}$ , conditioning on the success of the above high probability event, for $k=t+1$ , we have

[TABLE]

Substituting the above inequality to (D.1) yields

[TABLE]

Hence we have $\sigma_{k}^{2}\leq\tau_{1}^{2}$ and we have proved (31) for $k=t+1$ . By induction, we have (31) holds for $k\leq K$ w.p. $1-2K\delta_{0}=1-\delta$ . ∎

Now we are ready to provide the proof of Theorem 3.11. Recall the definition of $\tau_{1}^{2}$ in Lemma D.4. We restate Theorem 3.11 as follows to include the discussion of the specific parameters.

Theorem D.5.

Suppose Assumptions 3.1 and 3.2 hold and suppose the target accuracy level $\varepsilon\ll\|Q^{0}-Q^{*}\|_{\mu}$ . If we choose $\beta=\frac{(1-\gamma)\lambda_{0}}{18},\omega=\frac{(1-\gamma)\lambda_{0}^{2}}{18}$ and the sample size $|\mathcal{D}_{k}|=N\geq\max\left\{\frac{36\tau_{1}}{(1-\gamma)^{2}\varepsilon^{2}},\frac{6}{\omega^{2}}\right\}\log\frac{2Kd}{\delta}$ for each iteration, where $d$ is the dimension of the parameter $\theta$ . Then w.p. $1-\delta$ , the output of Algorithm 1 satisfies

[TABLE]

Consequently, we have $\|Q^{K}-Q^{\pi}\|_{\mu}\leq\mathcal{O}\big{(}\varepsilon+\|\Pi_{\mu}Q^{\pi}-Q^{\pi}\|_{\mu}\big{)}$ with $\mathcal{O}\left(\frac{1}{(1-\gamma)^{2}\varepsilon}\log\frac{1}{\varepsilon}\right)$ iterations and $\mathcal{O}\left(\frac{1}{\varepsilon^{2}}\log\frac{1}{\varepsilon}\right)$ samples in total.

Proof.

First, by Lemma D.4, we have $\sigma_{k}\leq\tau_{1}$ for $k\leq K$ w.p. $1-\delta$ . Then Lemma D.3 indicates that

[TABLE]

$w.p.\ 1-\delta$ . Then by Lemma C.2, we complete the proof. ∎

D.2 Proof of Theorem 3.12 (Stochastic Over-parameterized Linear GNTD)

By Lemma D.2, for any $\delta_{0},\eta\in(0,1)$ , when the batch size $|\mathcal{D}_{k}|\geq\frac{6}{\eta^{2}}\log\frac{2d}{\delta_{0}}$ , we have $\|\Phi^{\top}U\Phi-\Phi^{\top}\hat{U}^{k}\Phi\|\leq\eta$ $w.p.\ 1-\delta_{0}$ . Let us write $\Delta_{k}:=\Phi^{\top}\hat{U}^{k}\Phi+\omega I-\Phi^{\top}U\Phi$ and then $(\hat{H}^{k}+\omega I)^{-1}=(\Phi_{\mu}^{\top}\Phi_{\mu}+\Delta_{k})^{-1}$ . Notice that $\Phi_{\mu}^{\top}\Phi_{\mu}$ is no longer positive definite in the over-parameterized setting, we need to deal with the term $(\hat{H}^{k}+\omega I)^{-1}$ via the Sherman-Morrison-Woodbury (SMW) formula, where $\Delta_{k}$ is positive definite with high probability when $\omega>\eta$ . We also discuss the uniform upper bound of $\sigma_{k}$ in Lemma D.7.

Lemma D.6.

For any $\delta_{0}>0$ , we set $\beta\in(0,1)$ , $\omega\in(0,2)$ and the sample size $|\mathcal{D}_{k}|\geq\frac{24}{\omega^{2}}\log\frac{2d}{\delta_{0}}$ for $k$ -th iteration. Under Assumptions 3.1 and 3.4, we have w.p. $1-2\delta_{0}$ that

[TABLE]

Proof.

Recall that for linear approximation, $\theta^{k+1}=\theta^{k}-\beta(H^{k}+\omega I)^{-1}\hat{g}^{k}$ with $\hat{H}^{k}=\Phi\hat{U}^{k}\Phi$ . Then we can compute

[TABLE]

For any $\eta\in(0,1)$ when the sample size $|\mathcal{D}_{k}|\geq\frac{24}{\eta^{2}}\log\frac{2d}{\delta_{0}}$ , we have $w.p.\ 1-\delta_{0}$ that $\|\Phi^{\top}\hat{U}^{k}\Phi-\Phi^{\top}U\Phi\|\leq\eta$ . Let $\omega=2\eta$ , and we konw that $\Delta_{k}$ is invertible. Then by the Sherman-Morrison-Woodbury (SMW) formula, we have

[TABLE]

Consider the singular value decomposition, we write $\Phi_{\mu}=U_{1}\Sigma_{1}V_{1}^{\top},\Delta_{k}=U_{2}\Sigma_{2}U_{2}^{\top}$ . Then,

[TABLE]

where the last inequality is because $\|\Delta_{k}\|=\|\Phi^{\top}\hat{U}^{k}\Phi+\omega I-\Phi^{\top}U\Phi\|\leq\omega+\|\Phi^{\top}\hat{U}^{k}\Phi-\Phi^{\top}U\Phi\|\leq\omega+\eta=3\eta$ . Thus by Lemma D.1, we have $w.p.\ 1-2\delta_{0}$ that

[TABLE]

and

[TABLE]

where the second inequality uses the fact that $g(\theta^{k})=\Phi_{\mu}^{\top}U^{\frac{1}{2}}\delta^{k}$ . Plugging (34), (35) into (D.2) yields that given $\theta^{k}$ ,

[TABLE]

$w.p.\ 1-2\delta_{0}$ , where $T_{\beta}Q=(1-\beta)Q+\beta TQ$ . This completes the proof. ∎

Lemma D.7.

Suppose Assumptions 3.1 and 3.4 hold. We define

[TABLE]

Suppose the accuracy level $\varepsilon$ is small enough s.t. $\varepsilon\leq(1-(1-\gamma)\beta)^{K}\|Q^{0}-Q^{\pi}\|_{\mu}.$ If we set $\beta\in(0,1)$ , $\omega=\frac{(1-\gamma)^{2}\lambda_{0}^{3/2}\varepsilon}{4\tau_{2}}$ , and the batch size satisfies $|\mathcal{D}_{k}|\geq\max\left\{\frac{12\tau_{2}^{2}}{(1-\gamma)^{4}\lambda_{0}\varepsilon^{2}},\frac{24}{\omega^{2}}\right\}\log\frac{2Kd}{\delta}$ for each iteration. Then for $\forall\xi\sim\mathcal{D},\delta>0$ and $k=0,1,\cdots,K$ , we have w.p. $1-\delta$ that

[TABLE]

Proof.

Similar to Lemma D.4, we prove this lemma by induction. For $k=0$ , we have

[TABLE]

Thus the lemma holds for $k=0$ . Suppose it holds for all $k=0,1,\cdots,t$ , we prove this argument for $k=t+1$ .

For $k=0,1,\cdots,t$ , we choose $\delta_{0}=\frac{\delta}{2K}$ in Lemma D.6 and $|\mathcal{D}_{k}|\geq\max\left\{\frac{12\tau_{2}^{2}}{(1-\gamma)^{4}\lambda_{0}\varepsilon^{2}},\frac{24}{\omega^{2}}\right\}\log\frac{2Kd}{\delta}$ , $\omega=\frac{(1-\gamma)^{2}\lambda_{0}^{3/2}\varepsilon}{4\tau_{2}}$ , then $w.p.\ 1-2t\delta_{0}$ we have

[TABLE]

Note that for $\forall k$ , $\|Q^{k}-TQ^{k}\|_{\mu}\leq\|Q^{k}-Q^{\pi}\|_{\mu}+\|TQ^{k}-Q^{\pi}\|_{\mu}\leq(1+\gamma)\|Q^{k}-Q^{\pi}\|_{\mu}.$ Then conditioning on the success of the high probability events in the first $t$ steps, by (D.2) in Lemma D.6, we have

[TABLE]

where (i) follows (37), and (ii) follows from

[TABLE]

Therefore, we also have

[TABLE]

where (i) follows a computation similar to (D.1). Therefore, we prove by induction that (36) holds for $k\leq K$ , w.p. $1-2K\delta_{0}=1-\delta$ . ∎

Now we are ready to prove Theorem 3.12. To include the discussion of the specific parameters in the theorem, we restate Theorem 3.12 as follows.

Theorem D.8.

Suppose Assumptions 3.1 and 3.4 hold and the accuracy level $\varepsilon\ll\|Q^{0}-Q^{\pi}\|_{\mu}$ . If we set $\beta\in(0,1),\omega=\frac{(1-\gamma)^{2}\lambda_{0}^{3/2}\varepsilon}{4\tau_{2}}$ and the sample size $|\mathcal{D}_{k}|=N\geq\max\left\{\frac{12\tau_{2}^{2}}{(1-\gamma)^{4}\lambda_{0}\varepsilon^{2}},\frac{24}{\omega^{2}}\right\}\log\frac{2Kd}{\delta}$ for each iteration, where $d$ is the dimension of the parameter $\theta$ . Then w.p. $1-\delta$ the output of Algorithm 1 satisfies

[TABLE]

Then we can guarantee $\|Q^{K}-Q^{\pi}\|_{\mu}\leq\mathcal{O}(\varepsilon)$ with $\mathcal{O}\left(\frac{1}{(1-\gamma)\varepsilon}\log\frac{1}{\varepsilon}\right)$ iterations and $\mathcal{O}\left(\frac{1}{\varepsilon^{2}}\log\frac{1}{\varepsilon}\right)$ samples in total.

Proof.

First we have $w.p.\ 1-\delta$ that $\sigma_{k}\leq\frac{\tau_{2}}{1-\gamma},\forall k$ when the conditions in Lemma D.7 holds. By replacing $\sigma_{k}$ in Lemma D.6 with $\frac{\tau_{2}}{1-\gamma}$ , we have that

[TABLE]

Thus we complete the proof. ∎

D.3 Proof of Theorem 3.13 (Stochastic Neural GNTD)

Recall the stochastic GNTD formula with the neural network function (9) approximation:

[TABLE]

where the feature vector $\phi(s,a)\in\mathbb{R}^{d}$ and the parameters $\theta\in\mathbb{R}^{md}$ . Note that Lemma C.3, C.4, and C.5 are irrelevant to the update rule, thus they still hold in the current discussion. Next, we provide the uniform bound of the $\sigma_{k}$ defined in (23).

Lemma D.9.

Conditioning on the success of the high probability events of Lemma C.3 and C.5, where the success probability are chosen as $1-\delta_{1}$ , then for any $\xi$ and any $\theta$ satisfying $\|\theta-\theta^{0}\|_{2}\leq\frac{6\|Q^{0}-TQ^{0}\|_{\mu}}{(1-\gamma)\sqrt{\lambda_{\min}(G^{0})}}$ , we have w.p. $1-\delta_{1}$ that

[TABLE]

Proof.

First we compute the bounds on the gradient norm of the Q function as follows

[TABLE]

Consequently, by decomposing the $\theta$ in $\hat{g}(\theta;\xi)$ into $\theta^{0}$ and $\theta-\theta^{0}$ yields

[TABLE]

where (i) follows $\|\nabla Q(s,a;\theta)\|^{2}\leq 1$ and $\|x+y+z\|^{2}\leq 2(\|x\|^{2}+\|y\|^{2}+\|z\|^{2})$ , and (ii) follows the distance between $\theta$ and $\theta^{0}$ . ∎

Lemma D.10.

Suppose Assumptions 3.1 and 3.6 hold. For any $\delta_{2}>0$ , we choose $\omega\in(0,2)$ and the sample size $|\mathcal{D}_{k}|=N\geq\frac{24}{\omega^{2}}\log\frac{2md}{\delta_{2}}$ for $k$ -th iteration. If the iteration $\|\theta^{k}-\theta^{0}\|_{2}\leq\frac{6\|Q^{0}-TQ^{0}\|_{\mu}}{(1-\gamma)\sqrt{\lambda_{\min}(G^{0})}}$ , we have $w.p.\ 1-(\delta_{1}+2\delta_{2})$ that

[TABLE]

and

[TABLE]

where $\tilde{\tau}_{3}^{2}=(1-\gamma)^{2}\left(9(1+\gamma^{2})\max_{(s,a)}|Q(s,a;\theta^{0})|^{2}+9R_{\max}^{2}\right)+\frac{108(1+\gamma^{2})\|Q^{0}-TQ^{0}\|_{\mu}^{2}}{\lambda_{\min}(G^{0})}$ .

Proof.

According to Lemma D.9, for any $\theta^{k}$ that satisfies $\|\theta^{k}-\theta^{0}\|_{2}\leq\frac{6\|Q^{0}-TQ^{0}\|_{\mu}}{(1-\gamma)\sqrt{\lambda_{\min}(G^{0})}}$ , we have $\|\hat{g}(\theta^{k},\xi)\|\leq\frac{\tilde{\tau}_{3}}{1-\gamma}$ for any $\xi\sim\mathcal{D}$ . Recalling $\theta(s)=s\theta^{k+1}+(1-s)\theta^{k}$ in Lemma C.6, we consider

[TABLE]

The first term of the above equation is almost identical the estimate of equation (D.2) in Lemma D.6. Recall that Lemma C.6 in Section C.3 provides a technique for analyzing the residual term. We follow this derivation and get

[TABLE]

$w.p.\ 1-\delta_{1}$ , where (i) follows Lemma C.5 or the same derivation as Lemma C.6, and we can reduce $②$ to the case of estimating (D.2). Finally, by Lemma D.6 and D.9, we have $w.p.\ 1-(\delta_{1}+2\delta_{2})$ that

[TABLE]

where $\delta_{1}$ represents the probability that the $\mu$ -weighted Gram matrix $G^{k}$ is not positive definite, and $\delta_{2}$ represents the probability that the concentration inequality fails. Plugging (39) and (40) into (38), we complete the proof. ∎

To simplify the notation, let us denote

[TABLE]

Now we provide the proof of Theorem 3.13, which is restated as follows, with more detailed discussion of the parameters.

Theorem D.11.

Suppose Assumptions 3.1 and 3.6 hold and the accuracy level $\varepsilon\ll\|Q^{0}-TQ^{0}\|_{\mu}$ s.t. $\varepsilon\leq\frac{1}{2}\left(1-\frac{(1-\gamma)\beta}{2}\right)^{K}\|Q^{0}-TQ^{0}\|_{\mu}$ . We set $\beta\in(0,1)$ , $\omega=\frac{(1-\gamma)^{2}\varepsilon}{16\tau_{3}}$ , $m=\Omega\left(\frac{n^{3}}{\nu^{2}\lambda_{0}^{4}\delta^{2}(1-\gamma)^{2}}\right)$ and the batch size $|\mathcal{D}_{k}|=N\geq\max\left\{\frac{192\tau_{3}^{2}}{(1-\gamma)^{4}\varepsilon^{2}},\frac{24}{\omega^{2}}\right\}\log\frac{3Kmd}{\delta}$ s.t. $C\leq\frac{1-\gamma}{2(1+\gamma)}$ , where $d$ is the dimension of the feature map $\phi(\cdot,\cdot)$ , and $C=\mathcal{O}(m^{-1/6})$ is a small constant. Then w.p. $1-\delta$ , the output of Algorithm 1 satisfies

[TABLE]

Then we can guarantee $\|Q^{K}-Q^{\pi}\|_{\mu}\leq\mathcal{O}(\varepsilon)$ with $\mathcal{O}\left(\frac{1}{(1-\gamma)}\log\frac{1}{\varepsilon}\right)$ iterations and $\mathcal{O}\left(\frac{1}{\varepsilon^{2}}\log\frac{1}{\varepsilon}\right)$ samples in total.

Proof.

Similar to Section C.3, the key results of Lemmas D.9 and D.10 all depend on the condition that $\theta^{k}$ stays close to $\theta^{0}$ . Thus we will need to prove the following result by induction

[TABLE]

Obviously (41) holds for $k=0$ .

We assume (41) holds for $k=0,1,\cdots,t$ and prove this argument for $k=t+1$ . Note that D.9 and D.10 hold for $k=0,1,\cdots,t$ due to (41). Next we set $\delta_{1}=\frac{\delta}{3}$ and $\delta_{2}=\frac{\delta}{3K}$ . With $T_{\beta}Q:=(1-\beta)Q+\beta TQ$ , we have $w.p.\ 1-\delta_{1}-2t\delta_{2}$ that the following inequality holds for $k=0,1,\cdots,t$

[TABLE]

where (i) is due to the same derivation in Section C.3, (ii) is due to Lemma D.10, and (iii) is due to $C=\mathcal{O}(m^{-1/6})\leq\frac{(1-\gamma)}{2(1+\gamma)}$ as long as the network width $m$ is sufficiently large. Consequently, we have for $k\leq t$ that

[TABLE]

where the last inequality is due to the selection of $\omega$ and $N$ in the theorem. Note that the accuracy level $\varepsilon$ is small enough so that $\varepsilon\leq\frac{1}{2}\left(1-\frac{(1-\gamma)\beta}{2}\right)^{K}\|Q^{0}-TQ^{0}\|_{\mu}$ . Thus for $k=t+1$ , we have

[TABLE]

where (i) is due to Lemma D.10, and (ii) is due to $\frac{8\tau_{3}}{(1-\gamma)^{2}}\left(\omega+\sqrt{\frac{3\log(3K/\delta)}{N}}\right)\leq\varepsilon\leq\frac{1}{2}\left(1-\frac{(1-\gamma)\beta}{2}\right)^{s}\|Q^{0}-TQ^{0}\|_{\mu}$ for any $s$ . Therefore, the statement (41) holds.

The above verifies that when the conditions in Theorem D.11 hold, $\theta^{k}$ will always stay close to the initialization parameters $\theta^{0}$ with high probability. Thus the lemmas in Section D.3 are all correct. Finally, by Lemma D.10, for each $k=0,1,\cdots,K-1$ , we have $w.p.\ 1-(\delta_{1}+2K\delta_{2})=1-\delta$ that

[TABLE]

This completes the proof. ∎

D.4 Proof of Theorem 3.14 (Stochastic Nonlinear Smooth GNTD)

We restate Theorem 3.14 as follows to include the specifics of the parameters.

Theorem D.12.

Suppose Assumptions 3.1, 3.8 and 3.9 hold. we set $\beta\in(0,1)$ and the damping rate $\omega\in(0,1)$ for each iteration, where $\theta\in\mathbb{R}^{d}$ . Using the Proximal Boost (ProxBoost) method to obtain $\hat{d}^{k}$ , the output $Q^{K}$ of Algorithm 1 has w.p. $1-\delta$ that

[TABLE]

for given constants $M_{1},M_{2}>0$ . We choose the step size $\beta=\mathcal{O}((1-\gamma)\varepsilon)$ , the damping rate $\omega=\mathcal{O}((1-\gamma)^{2}\varepsilon^{2})$ and the sample size $N=\Omega\left(\frac{1}{(1-\gamma)^{2}\varepsilon^{2}}\log\frac{K}{\delta}\right)$ for any $\varepsilon>0$ . Then we can guarantee $\|Q^{K}-Q^{\pi}\|_{\mu}\leq\mathcal{O}(\varepsilon+\varepsilon_{\mathcal{F}})$ with $\mathcal{O}\left(\frac{1}{(1-\gamma)\varepsilon}\log\frac{1}{\varepsilon}\right)$ iterations and $\mathcal{O}\left(\frac{1}{\varepsilon^{3}}(\log\frac{1}{\varepsilon})^{2}\right)$ samples in total.

Proof.

To begin with, we rewrite the problem (6) as $L_{N}^{\omega}(\theta^{k},\overline{d}):=\frac{1}{N}\sum_{\xi\in\mathcal{D}_{k}}\ell(\xi,\theta^{k},\overline{d})+\frac{\omega}{2}\|\overline{d}\|^{2},$ and define its population version as $L^{\omega}(\theta^{k},\overline{d}):=\mathbb{E}_{\xi\sim\mathcal{D}}\left[\ell(\xi,\theta^{k},\overline{d})\right]+\frac{\omega}{2}\|\overline{d}\|^{2}.$ Recalling the definition of $M_{1}$ in Section 3.1.3, we have $\|d^{k}\|^{2}\leq\frac{M_{1}}{L_{2}}$ . Let $d^{k}_{\omega}=\arg\min_{\overline{d}}L^{\omega}(\theta^{k},\overline{d})$ . For any $\omega>0$ , we have $\|d^{k}_{\omega}\|^{2}\leq\frac{M_{1}}{L_{2}}$ .

Now we consider the constrained subproblem $\min_{\left\{\overline{d}:\|\overline{d}\|^{2}\leq\frac{4M_{1}}{L_{2}}\right\}}L^{\omega}(\theta^{k},\overline{d})$ . To solve this subproblem, we use the ProxBoost procedure, whose output is $\hat{d}^{k}$ for each $k$ . Set $\omega\in(0,1)$ and we have $d^{k}_{\omega}=\arg\min_{\left\{\overline{d}:\|\overline{d}\|^{2}\leq\frac{4M_{1}}{L_{2}}\right\}}L^{\omega}(\theta^{k},\overline{d})$ . For any $\overline{d}$ satisfying $\|\overline{d}\|^{2}\leq\frac{4M_{1}}{L_{2}}$ , $L^{\omega}(\theta^{k},\overline{d})$ is $L_{0}$ - $Lipschitz$ and $2\lambda_{0}$ -strongly convex w.r.t. $\overline{d}$ , where $L_{0}=\frac{(2L_{1}^{2}+1)\sqrt{M_{1}}}{\sqrt{L_{2}}}+(1+\gamma)L_{1}^{2}+L_{1}R_{\max}$ . By Lemma D.2, when the number of samples per iteration in the ProxBoost method is at least $\frac{24L_{1}^{4}}{\lambda_{0}^{2}}\log(12d)$ , the empirical loss function is $L_{0}$ - $Lipschitz$ and $\lambda_{0}$ -strongly convex with probability at least $\frac{5}{6}$ . At this point, the subproblem $L^{\omega}(\theta^{k},\overline{d})$ satisfies the convergence conditions of the ProxBoost method. By Corollary 7 in (Davis et al., 2021), for any $\delta_{0},\varepsilon>0$ , we have $w.p.\ 1-\delta_{0}$ that

[TABLE]

when the total number of samples used by the ProxBoost method in $k$ -th iteration is

[TABLE]

Thus,

[TABLE]

Therefore, there exists a constant $M_{2}$ such that

[TABLE]

Let $\delta_{0}=\frac{\delta}{K}$ for each iteration of GNTD. Then similar to Theorem 3.10, we have $w.p.\ 1-K\delta_{0}=1-\delta$ that

[TABLE]

where (i) is due to the key derivation in Section C.4, and (ii) is due to the gap between $L(\theta^{k},\hat{d}^{k})$ and $L(\theta^{k},d^{k})$ . Choose $k=0,1,\cdots,K-1$ in the above inequality and add them all. Then, we have $w.p.\ 1-\delta$ that

[TABLE]

Thus we complete the proof. ∎

Appendix E Additional Experiments on Section 4.2.2

In this section, we supplement some implementation details and numerical results of GNTD3+BC. On the basis of TD3-BC (Fujimoto & Gu, 2021), it is common to add a behavioral cloning regular term $\lambda$ to constrain the expected total return

[TABLE]

For the critic part, we apply GNTD method to minimize the following MSBE of the Clipped Double DQN (Fujimoto et al., 2018), that is,

[TABLE]

where

[TABLE]

See Algorithm 4 for more details.

Figure 3 shows the training curves for different algorithms. GNTD3+BC algorithm (red) outperforms TD3+BC algorithm (green) both in terms of final results and convergence speed.

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Achiam et al. (2019) Achiam, J., Knight, E., and Abbeel, P. Towards characterizing divergence in deep q-learning. ar Xiv preprint ar Xiv:1903.08894 , 2019.
2Agarwal et al. (2020) Agarwal, R., Schuurmans, D., and Norouzi, M. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning , pp. 104–114. PMLR, 2020.
3Agazzi & Lu (2022) Agazzi, A. and Lu, J. Temporal-difference learning with nonlinear function approximation: lazy training and mean field regimes. In Mathematical and Scientific Machine Learning , pp. 37–74. PMLR, 2022.
4Allen-Zhu et al. (2019) Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning , pp. 242–252. PMLR, 2019.
5Bhandari et al. (2018) Bhandari, J., Russo, D., and Singal, R. A finite time analysis of temporal difference learning with linear function approximation. In Conference on learning theory , pp. 1691–1692. PMLR, 2018.
6Borkar (2009) Borkar, V. S. Stochastic approximation: a dynamical systems viewpoint , volume 48. Springer, 2009.
7Boyan (2002) Boyan, J. A. Technical update: Least-squares temporal difference learning. Machine learning , 49(2):233–246, 2002.
8Bradtke & Barto (1996) Bradtke, S. J. and Barto, A. G. Linear least-squares algorithms for temporal difference learning. Machine learning , 22(1):33–57, 1996.

TL;DR

Contribution

Abstract

Peer Reviews

Videos

Taxonomy

Provably Efficient Gauss-Newton Temporal Difference Learning Method

Abstract

1 Introduction

1.1 Contributions

1.2 Related Work

2 Gauss-Newton Temporal Difference Learning

2.1 Preliminaries

2.2 The GNTD Method

3 Convergence Analysis

Assumption 3.1**.**

3.1 Iteration Complexity for Population Update

3.1.1 Linear Approximation

Assumption 3.2**.**

Theorem 3.3**.**

Assumption 3.4**.**

Theorem 3.5**.**

3.1.2 Neural Network Approximation

Assumption 3.6**.**

Theorem 3.7**.**

3.1.3 General Smooth Function Approximation

Assumption 3.8**.**

Assumption 3.9**.**

Theorem 3.10**.**

3.2 Sample Complexity for Stochastic Update

3.2.1 Linear Approximation

Theorem 3.11**.**

Theorem 3.12**.**

3.2.2 Neural Network Approximation

Theorem 3.13**.**

3.2.3 General Smooth Function Approximation

Theorem 3.14**.**

4 Experiments

4.1 Policy Optimization with GNTD Method

4.2 Offline Reinforcement Learning Tasks

4.2.1 Discrete Action Tasks

4.2.2 Continuous Tasks

Appendix A Kronecker-Factored Approximate Curvature (K-FAC) Method for GNTD

Appendix B Extending GNTD to Q-learning Algorithms

Appendix C Analysis of Population Updates in Section 3.1

C.1 Proof of Theorem 3.3 (Under-parameterized Linear GNTD)

Lemma C.1**.**

Lemma C.2**.**

Proof.

C.2 Proof of Theorem 3.5 (Over-parameterized Linear GNTD)

C.3 Proof of Theorem 3.7 (Neural GNTD)

Lemma C.3**.**

Lemma C.4**.**

Lemma C.5**.**

Proof.

Lemma C.6**.**

Proof.

Proof.

Remark C.7*.*

C.4 Proof of Theorem 3.10 (Nonlinear Smooth GNTD)

Proof.

Appendix D Analysis of Stochastic Updates in Section 3.2

Lemma D.1**.**

Lemma D.2**.**

Proof.

D.1 Proof of Theorem 3.11 (Stochastic Under-parameterized Linear GNTD)

Lemma D.3**.**

Proof.

Lemma D.4**.**

Proof.

Theorem D.5**.**

Proof.

D.2 Proof of Theorem 3.12 (Stochastic Over-parameterized Linear GNTD)

Lemma D.6**.**

Proof.

Lemma D.7**.**

Proof.

Theorem D.8**.**

Proof.

D.3 Proof of Theorem 3.13 (Stochastic Neural GNTD)

Assumption 3.1.

Assumption 3.2.

Theorem 3.3.

Assumption 3.4.

Theorem 3.5.

Assumption 3.6.

Theorem 3.7.

Assumption 3.8.

Assumption 3.9.

Theorem 3.10.

Theorem 3.11.

Theorem 3.12.

Theorem 3.13.

Theorem 3.14.

Lemma C.1.

Lemma C.2.

Lemma C.3.

Lemma C.4.

Lemma C.5.

Lemma C.6.

*Remark C.7**.*

Lemma D.1.

Lemma D.2.

Lemma D.3.

Lemma D.4.

Theorem D.5.

Lemma D.6.

Lemma D.7.

Theorem D.8.

Lemma D.9.

Lemma D.10.

Theorem D.11.

Theorem D.12.