Sample Complexity of Estimating the Policy Gradient for Nearly   Deterministic Dynamical Systems

Osbert Bastani

arXiv:1901.08562·cs.LG·October 12, 2021

Sample Complexity of Estimating the Policy Gradient for Nearly Deterministic Dynamical Systems

Osbert Bastani

PDF

Open Access

TL;DR

This paper develops a theoretical framework showing that for nearly deterministic systems, finite-difference policy gradient estimates can have lower variance than traditional methods, with empirical validation on an inverted pendulum.

Contribution

It introduces a new theoretical understanding of policy gradient estimation in nearly deterministic systems, highlighting the advantages of finite-difference methods.

Findings

01

Finite-difference estimates have lower variance in nearly deterministic systems.

02

Theoretical analysis explains the effectiveness of finite-difference methods.

03

Empirical results on the inverted pendulum support the theory.

Abstract

Reinforcement learning is a promising approach to learning robotics controllers. It has recently been shown that algorithms based on finite-difference estimates of the policy gradient are competitive with algorithms based on the policy gradient theorem. We propose a theoretical framework for understanding this phenomenon. Our key insight is that many dynamical systems (especially those of interest in robotics control tasks) are nearly deterministic -- i.e., they can be modeled as a deterministic system with a small stochastic perturbation. We show that for such systems, finite-difference estimates of the policy gradient can have substantially lower variance than estimates based on the policy gradient theorem. Finally, we empirically evaluate our insights in an experiment on the inverted pendulum.

Figures3

Click any figure to enlarge with its caption.

Equations499

s_{t + 1} = f (s_{t}, a_{t}) + ζ_{t} where ζ_{t} \sim p (ζ),

s_{t + 1} = f (s_{t}, a_{t}) + ζ_{t} where ζ_{t} \sim p (ζ),

J (θ) = E_{p_{θ} (α)} [t = 0 \sum T - 1 R (s_{t}, a_{t})],

J (θ) = E_{p_{θ} (α)} [t = 0 \sum T - 1 R (s_{t}, a_{t})],

D (θ) = \nabla_{θ} J (θ)

D (θ) = \nabla_{θ} J (θ)

Q_{θ}^{(t)} (s, a)

Q_{θ}^{(t)} (s, a)

V_{θ}^{(t)} (s)

\nabla_{θ} J (θ)

\nabla_{θ} J (θ)

\hat{J} (θ; ζ)

\hat{D}_{MB} (θ) = \frac{1}{n} i = 1 \sum n \hat{J} (θ; ζ^{(i)})

\hat{D}_{MB} (θ) = \frac{1}{n} i = 1 \sum n \hat{J} (θ; ζ^{(i)})

\tilde{Q}_{θ}^{(t)} (s, a)

\tilde{Q}_{θ}^{(t)} (s, a)

\tilde{V}_{θ}^{(t)} (s)

\nabla_{θ} J (θ) = E_{\tilde{p}_{θ} (α)} [t = 0 \sum T - 1 \tilde{Q}_{θ}^{(t)} (s_{t}, a_{t}) \nabla_{θ} lo g \tilde{π}_{θ} (a_{t} ∣ s_{t})] .

\nabla_{θ} J (θ) = E_{\tilde{p}_{θ} (α)} [t = 0 \sum T - 1 \tilde{Q}_{θ}^{(t)} (s_{t}, a_{t}) \nabla_{θ} lo g \tilde{π}_{θ} (a_{t} ∣ s_{t})] .

\tilde{Q}_{θ}^{(t)} (s, a)

\tilde{Q}_{θ}^{(t)} (s, a)

\hat{Q}_{θ}^{(t)} (α)

\nabla_{θ} J (θ) = E_{\tilde{p}_{θ} (α)} [t = 0 \sum T - 1 \tilde{A}_{θ}^{(t)} (α) \nabla_{θ} lo g \tilde{π}_{θ} (a_{t} ∣ s_{t})] .

\nabla_{θ} J (θ) = E_{\tilde{p}_{θ} (α)} [t = 0 \sum T - 1 \tilde{A}_{θ}^{(t)} (α) \nabla_{θ} lo g \tilde{π}_{θ} (a_{t} ∣ s_{t})] .

\hat{D}_{PG} (θ)

\hat{D}_{PG} (θ)

\hat{A}_{θ}^{(t)} (α)

\nabla_{x} f (x)

\nabla_{x} f (x)

\nabla_{θ} J (θ) \approx k = 1 \sum d_{Θ} \frac{J ( θ + λ ν ^{(k)} ) - J ( θ - λ ν ^{(k)} )}{2 λ} \cdot ν^{(k)} .

\nabla_{θ} J (θ) \approx k = 1 \sum d_{Θ} \frac{J ( θ + λ ν ^{(k)} ) - J ( θ - λ ν ^{(k)} )}{2 λ} \cdot ν^{(k)} .

\hat{D}_{FD} (θ) =

\hat{D}_{FD} (θ) =

\displaystyle\hskip 28.90755pt-\frac{\frac{1}{n}\sum_{j=1}^{n}\hat{J}(\theta-\lambda\nu^{(k)};\vec{\eta}^{(k,j)})}{2\lambda}\bigg{]}\cdot\nu^{(k)}

Pr_{x^{(1)}, ..., x^{(n)} \sim p_{X} (x)} [∥ \overset{μ}{^}_{X}^{(n)} ∥ \geq ϵ] \leq δ .

Pr_{x^{(1)}, ..., x^{(n)} \sim p_{X} (x)} [∥ \overset{μ}{^}_{X}^{(n)} ∥ \geq ϵ] \leq δ .

n_{MB} (ϵ, δ)

n_{MB} (ϵ, δ)

n_{MB} (ϵ, δ)

n_{PG} (ϵ, δ)

n_{PG} (ϵ, δ)

n_{PG} (ϵ, δ)

n_{PG} (ϵ, δ)

n_{PG} (ϵ, δ)

n_{FD} (ϵ, δ)

n_{FD} (ϵ, δ)

n_{FD} (ϵ, δ)

Pr [\overset{μ}{^}_{X}^{(n)} \geq ϵ] \leq δ,

Pr [\overset{μ}{^}_{X}^{(n)} \geq ϵ] \leq δ,

∣ \hat{D}_{MB} (θ) - \nabla_{θ} J (θ) ∣ \leq A E + B,

∣ \hat{D}_{MB} (θ) - \nabla_{θ} J (θ) ∣ \leq A E + B,

R ((ϑ, ω), a) = - (w_{ϑ} \cdot (ϑ - ϑ_{0})^{2} + w_{ω} \cdot ω^{2} + w_{a} \cdot a^{2}),

R ((ϑ, ω), a) = - (w_{ϑ} \cdot (ϑ - ϑ_{0})^{2} + w_{ω} \cdot ω^{2} + w_{a} \cdot a^{2}),

J (θ)

J (θ)

V_{θ}^{(t)} (s)

V_{θ}^{(T)} (s)

\nabla_{θ} J (θ)

\nabla_{θ} J (θ)

\nabla_{θ} V_{θ}^{(t)} (s)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adversarial Robustness in Machine Learning

Full text

**Sample Complexity of Estimating the Policy Gradient

for Nearly Deterministic Dynamical Systems**

Osbert Bastani

University of Pennsylvania, USA

Abstract

Reinforcement learning is a promising approach to learning robotics controllers. It has recently been shown that algorithms based on finite-difference estimates of the policy gradient are competitive with algorithms based on the policy gradient theorem. We propose a theoretical framework for understanding this phenomenon. Our key insight is that many dynamical systems (especially those of interest in robotics control tasks) are nearly deterministic—i.e., they can be modeled as a deterministic system with a small stochastic perturbation. We show that for such systems, finite-difference estimates of the policy gradient can have substantially lower variance than estimates based on the policy gradient theorem. Finally, we empirically evaluate our insights in an experiment on the inverted pendulum.

1 Introduction

The policy gradient is the workhorse of modern reinforcement learning. In particular, most state-of-the-art reinforcement learning algorithms aim to learn a control policy $\pi_{\theta}$ by estimating the policy gradient—i.e., the gradient $\nabla_{\theta}J(\theta)$ of the expected cumulative reward $J(\theta)$ with respect to the parameters $\theta$ of the control policy—in one of two ways: (i) numerically, e.g., using a finite-difference approximation (Kober et al., 2013; Mania et al., 2018), or (ii) by using the policy gradient theorem (Sutton et al., 2000) to construct estimates (Silver et al., 2014; Schulman et al., 2015a, b, 2017). However, there has been little work on theoretically understanding the tradeoffs between these two approaches, and our work aims to help fill this gap.

We are interested in applications to robotics control, which typically have continuous state and action spaces (Collins et al., 2005; Abbeel et al., 2007; Levine et al., 2016). For example, reinforcement learning can be used to learn controllers when the dynamics are unknown (Abbeel et al., 2007; Ross and Bagnell, 2012; Akametalu et al., 2014; Berkenkamp et al., 2017; Johannink et al., 2018). Understanding sample complexity is especially important in this application, since the goal is for robots to be able to learn based on real world experience, which can be very costly to obtain. Furthermore, having a theoretical understanding of sample complexity is important for developing safe reinforcement learning algorithms (Akametalu et al., 2014; Berkenkamp et al., 2017; Dean et al., 2018b).

We argue that near determinism is an important characteristic of dynamical systems relevant to robotics. More precisely, we study settings where the noise in the dynamics is “small” (i.e., sub-Gaussian with small constant). This setting captures robotics tasks such as grasping (Andrychowicz et al., 2018), quadcopters (Akametalu et al., 2014), walking (Collins et al., 2005), and driving (Montemerlo et al., 2008), where the dynamics are primarily deterministic but include small perturbations such as wind, friction, or slippage. We discuss this claim in detail below.

Main results. In the context of near determinism, we analyze the sample complexity of various algorithms for estimating the policy gradient $\nabla_{\theta}J(\theta)$ . We study three algorithms: (i) an algorithm based on finite-differences, (ii) an algorithm based on the policy gradient theorem, and (iii) a model-based algorithm (i.e., it knows the system dynamics) that uses backpropagation to estimate the policy gradient. The model-based algorithm represents the best convergence rate we can hope to achieve using only random samples of the noise. We give details on these algorithms in Section 3.

Our key parameter of interest is the sub-Gaussian parameter $\sigma_{\zeta}$ of the system noise $\zeta$ , which is small for nearly deterministic systems. Here, we also consider dependences on the estimation error $\epsilon$ and the dimension $d_{\Theta}$ of the parameter space; we state theorems giving dependences on all parameters in Section 4. We prove the following bounds on the sample complexity $n$ (i.e., the number of samples needed to get at most $\epsilon$ error with probability at least $1-\delta$ ):

•

For the model-based estimate, $n=\tilde{\Theta}(\sigma_{\zeta}^{2}/\epsilon^{2})$ .

•

For the finite-differences estimate, $n=\tilde{\Theta}(\sigma_{\zeta}^{2}d_{\Theta}/\epsilon^{4})$ .

•

For the estimate based on the policy gradient theorem, $n=\tilde{O}(1/\epsilon^{2})$ and $n=\tilde{\Omega}(1/\epsilon)$ .

Our key finding is that while both the model-based and finite-difference estimates become small as $\sigma_{\zeta}$ becomes small, the estimate based on the policy gradient theorem does not. Thus, for nearly deterministic dynamical systems, finite-difference algorithms perform significantly better. However, this improvement comes at a price— $n$ depends on $d_{\Theta}$ , and furthermore quadratically more samples are needed to get to the same estimation error.

Finally, we focus on how many samples are needed to estimate the policy gradient on a single step. This understanding is already useful for applications such as safe reinforcement learning. Nevertheless, we discuss how our results connect to the problem of optimizing $J(\theta)$ in Section 4.

Motivation for near determinism. A common approach in robotics is to model the robot dynamics as deterministic (Levinson et al., 2011; Kuindersma et al., 2016). To account for stochasticity, either a stabilizing controller such as a PID controller is used (Levinson et al., 2011), or the robot’s trajectory is replanned at every step (Kwon et al., 1983; Kuindersma et al., 2016). An alternative approach is to assume that the dynamics are deterministic plus a bounded perturbation at each step, and then use robust control (Akametalu et al., 2014). Both approaches implicitly assume that the deterministic portion of the dynamics are a good approximation of the full dynamics. In general, most systems that have been successfully studied in reinforcement learning are nearly deterministic, including Atari games (Mnih et al., 2015), MuJoCo benchmarks (Todorov et al., 2012; Levine and Koltun, 2013), and simulated grasping tasks (Andrychowicz et al., 2018).

More importantly, we believe that it will be challenging to increase the sample efficiency of reinforcement learning in systems where the noise is high. Indeed, our analysis shows that noise can be greatly amplified by the dynamics, so if the noise is large, we believe there is very little hope for sample-efficient reinforcement learning. In these settings, we may need to rely on techniques such as transfer learning (Taylor and Stone, 2009), meta-learning (Finn et al., 2017), or learning to plan (Tamar et al., 2016) to achieve low sample complexity.

Related work. The theoretical work in reinforcement learning algorithms has primarily focused on $Q$ -learning (Watkins and Dayan, 1992; Kearns and Singh, 2002; Kakade et al., 2003; Jin et al., 2018), especially for Markov decision processes (MDPs) with finite state and action spaces. There has been some work on understanding the sample complexity of reinforcement learning with function approximation—e.g., for fitted value iteration (Munos and Szepesvári, 2008), for fitted policy iteration (Antos et al., 2008; Lazaric et al., 2012; Farahmand et al., 2015, 2016), fitted $Q$ -iteration (Tosatto et al., 2017), and the $\text{TD}(0)$ algorithm (Dalal et al., 2018). For robotics tasks, where state and action spaces are typically continuous, the most successful approaches are predominantly based on policy gradient estimation (Collins et al., 2005; Kober et al., 2013), for which there has been relatively little work. In this direction, (Kakade et al., 2003) has analyzed the sample complexity of algorithms based on the policy gradient theorem, but they do not study the dependence of the sample complexity on the magnitude of the system noise. Furthermore, their work assumes finite state and action spaces and bounded rewards, and they do not consider finite-difference algorithms.

There has been work characterizing a key design choice of finite-difference algorithms—i.e., the distribution of perturbations used to numerically estimate the policy gradient (Roberts and Tedrake, 2009). They measure the performance of different choices using the signal-to-noise ratio. In contrast, our goal is to understand the sample complexity of different algorithms for nearly deterministic systems.

There has recently been work on understanding the sample complexity of learning controllers; however, they focus on linear dynamical systems, and on different algorithms—e.g., temporal difference learning (Tu and Recht, 2018b) or model-based algorithms (Dean et al., 2018a; Tu and Recht, 2018a). There has also been work in this setting studying the possibility of reducing variance by controlling the noise in the dynamics (Malik et al., 2019); in the setting we study, we cannot control the noise.

There has been recent work comparing approaches based on exploration in the action space (based on the policy gradient theorem) to exploration in the state space (based on finite difference methods) (Vemula et al., 2019). Our focus on nearly deterministic systems enables us to obtain qualitatively different insights compared to theirs. In particular, they find that approaches based on finite differences perform better for problems with a long time horizon. However, we analyze a more realistic model, and find that this insight no longer holds. Instead, approaches based on finite differences outperform approaches based on the policy gradient theorem for nearly deterministic systems.

Our analysis differs in three key ways. First, they assume an upper bound $J(\theta)\leq J_{\text{max}}$ , which is a very strong assumption. Second, their analysis does not model stochastic dynamics. Instead, they assume that $J(\theta)$ is deterministic, but they can only obtain observations $J(\theta)+\zeta$ , where $\zeta$ is i.i.d. noise. In contrast, our analysis considers both stochastic dynamics, as well as how noise is propagated through the dynamics. This distinction substantially complicates our analysis, but is necessary for us to understand the implications of near determinism (since we need to understand how the dynamics can amplify noise). Finally, unlike their work, we provide lower bounds for our main results.

Connection to optimizing $J(\theta)$ . Estimating the policy gradient can be used in conjunction with stochastic gradient descent to optimize $J(\theta)$ . There is a large body of work on understanding the convergence rate of stochastic gradient descent (Robbins and Monro, 1985; Spall et al., 1992; Bottou and Bousquet, 2008; Moulines and Bach, 2011), of which policy gradient algorithms are a special case. Indeed, (Vemula et al., 2019) uses these techniques to bound the complexity of optimizing $J(\theta)$ .

There are several reasons why we focus on understanding the sample complexity of a single gradient step rather than the sample complexity of optimization. First, they rely on the strong assumption that $J(\theta)$ is bounded—i.e., $J(\theta)\leq J_{\text{max}}$ for some $J_{\text{max}}\in\mathbb{R}_{+}$ . Second, it would be much more difficult to derive lower bounds on optimization—existing lower bounds are for the setting where the objective $f$ coming from a very general function family, and these bounds may not apply when $f$ is restricted to be the objective of a reinforcement learning problem. In contrast, for sample complexity, we derive matching (or almost matching) upper and lower bounds. Third, the sample complexity of estimating $\nabla_{\theta}J(\theta)$ is of intrinsic interest—for example, it is an important prerequisite for safe reinforcement learning algorithms (Akametalu et al., 2014; Berkenkamp et al., 2017; Dean et al., 2018b). Finally, focusing on sample complexity simplifies our key insight. In particular, consider the the completely deterministic setting—optimizing a deterministic function using gradient descent may still take many steps, but “estimating” the gradient only requires a single sample.

Additionally, we note that sample complexity is directly related to the complexity of optimizing $J(\theta)$ . In particular, the bounds in Vemula et al. (2019) all depend directly on the variance $\sigma^{2}$ of the observations $J(\theta)+\zeta$ . Our proof bounds the sample complexity of estimating $\nabla J(\theta)$ by bounding the sub-Gaussian parameter of $J(\theta)$ , which is an upper bound on the variance of $J(\theta)$ . Thus, smaller sample complexity translates to smaller complexity of optimizing $J(\theta)$ .

Finally, our focus on estimating the gradient does not address the problem of exploration. In terms of optimization, gradient estimates can be used in conjunction with gradient descent to efficiently find local minima, whereas exploration is needed to find global minima. Understanding the sample complexity of exploration is an important but orthogonal problem that we leave to future work.

2 Preliminaries

We consider a dynamical system with states $S\subseteq\mathbb{R}^{d_{S}}$ , actions $A\subseteq\mathbb{R}^{d_{A}}$ , and transitions

[TABLE]

where $f:S\times A\to S$ is deterministic and $\zeta\in\mathbb{R}^{d_{S}}$ is a random perturbation. We consider deterministic control policies $\pi_{\theta}:S\to A$ with parameters $\theta\in\Theta\subseteq\mathbb{R}^{d_{\Theta}}$ . Except in the case of the model-based policy gradient algorithm, we assume that both $f$ and $p$ are unknown. We separate $f$ from $p$ since we are interested in settings where $\zeta$ is small. Also, we that assume $\zeta_{t}$ is independent of $s_{t}$ and $a_{t}$ . This assumption enables us to substantially simplify the model-based policy gradient (since we avoid taking a derivatives of $p$ ), and it also simplifies our analyses of other algorithms.

A common approach is to use an estimate $\hat{Q}_{\theta}^{(t)}(s,a)$ of the $Q$ function in place of $\hat{Q}_{\theta}^{(t)}(\alpha)$ . This approach reduces variance, but may introduce bias. For instance, for dynamical systems with continuous actions, the deterministic policy gradient (DPG) algorithm uses this approach Silver et al. (2014). We consider the algorithm described above for two reasons. First, our focus is on estimating the policy gradient, rather than understanding the sample complexity of $Q$ -learning, which is required to analyze DPG. Second, it is hard to prove bounds for DPG since it relies on the derivative of the $Q$ function, which cannot be bounded without additional assumptions. For example, suppose we train a random forest $\hat{Q}_{\theta}^{(t)}(s,a)$ . Even if this model achieves achieves good accuracy, its gradient would be zero nearly everywhere since this model is piecewise constant; thus, it would not be useful in the context of the DPG algorithm.

Finite-difference policy gradient. We can use finite-differences to estimate $\nabla_{\theta}J(\theta)$ .

Theorem 3.3.

For any $f:\mathcal{X}\to\mathbb{R}$ (where $\mathcal{X}\subseteq\mathbb{R}^{d}$ ) where $\nabla f$ is $L_{\nabla f}$ -Lipschitz continuous, 111We assume the $L_{2}$ norm throughout.

[TABLE]

where $\nu^{(k)}=\delta_{k}$ (where $\delta_{k}$ is the Kronecker delta), and $\Delta\in\mathbb{R}$ satisfies $\|\Delta\|\leq L_{\nabla f}d\lambda$ .

We give a proof in Appendix E. Then, the finite difference approximation of the policy gradient is

[TABLE]

We can estimate $J(\theta)$ using samples $\vec{\zeta}\sim p(\vec{\zeta})$ , which yields the estimator $\nabla_{\theta}J(\theta)\approx\hat{D}_{\text{FD}}(\theta)$ , where

[TABLE]

where $\vec{\zeta}^{(k,i)},\vec{\eta}^{(k,j)}\sim p(\vec{\zeta})$ i.i.d. for $k\in[m]$ and $i,j\in[n]$ . Note that we use separate samples $\zeta^{(k,i)}$ and $\eta^{(k,j)}$ to estimate $J(\theta+\lambda\nu^{(k)})$ and $J(\theta-\lambda\nu^{(k)})$ , respectively. If we are using a simulator, then we can reduce variance by using the same samples to estimate both terms.

Remark 3.4.

Typically, rather than choose a fixed set of basis vectors $\nu^{(1)},...,\nu^{(k)}$ , finite-difference algorithms choose random vectors from a spherically symmetric distribution—e.g., $\nu\sim\mathcal{N}(0,\sigma^{2}I_{d_{\theta}})$ (Spall et al., 1992; Mania et al., 2018). Our choice of a fixed basis simplifies our analysis.

4 Main Results

Sample complexity. Recall that the policy gradient $\nabla_{\theta}J(\theta)$ must be estimated from sampled rollouts $\zeta\sim p_{\theta}(\zeta)$ . Our goal is to understand the tradeoffs in sample complexity of estimating $\nabla_{\theta}J(\theta)$ between various different reinforcement learning algorithms.

Definition 4.1.

Let $X$ be a random vector, and let $\hat{\mu}_{X}^{(n)}=n^{-1}\sum_{i=1}^{n}x^{(i)}$ , where $x^{(1)},...,x^{(n)}\sim p_{X}(x)$ i.i.d. The sample complexity of $n_{X}(\epsilon,\delta)$ of $X$ is the smallest $n\in\mathbb{N}$ such that

[TABLE]

We are interested in the sample complexity $n_{\hat{D}}$ of $\hat{D}(\zeta)-\nabla_{\theta}J(\theta)$ , where $\hat{D}(\zeta)$ is an estimate of $\nabla_{\theta}J(\theta)$ using a single rollout $\zeta\sim p_{\theta}(\zeta)$ .

Assumptions. We let $f_{\theta}(s)=f(s,\pi_{\theta}(s))$ and $R_{\theta}(s)=R(s,\pi_{\theta}(s))$ . Similarly, for a stochastic policy $\pi_{\theta}(s)+\xi$ (where $\xi\sim p(\xi)$ ), we let $\tilde{f}_{\theta}(s,\xi)=f(s,\pi_{\theta}(s)+\xi)$ and $\tilde{R}_{\theta}(s)=\mathbb{E}_{p(\xi)}[R(s,\pi_{\theta}(s)+\xi)]$ . Next, to ensure convergence, we make regularity assumptions about the dynamics and our control policy; see Appendix F & G for definitions.

Assumption 4.2.

We assume that $f$ , $R$ , $\pi_{\theta}$ , $f_{\theta}$ , $\tilde{f}_{\theta}$ , $R_{\theta}$ and $\tilde{R}_{\theta}$ are Lipschitz continuous and are twice continuously differentiable with Lipschitz continuous first derivative.

Remark 4.3.

This standard assumption is needed to ensure that we can estimate the gradient using finite differences. It is somewhat strong—e.g., it rules out commonly used quadratic rewards. In practice, the state space is often compact, in which case the Lipschitz continuity assumption becomes redundant. However, we cannot handle discontinuous rewards or dynamics (including piecewise constant rewards). In these cases, the policy gradient may diverge near the discontinuities; thus, the sample complexity of estimating this gradient may diverge as well. In principle, we could handle discontinuities as long as the policy visits these discontinuities with zero probability.

Finally, for any function $h$ , we let $L_{h}$ denote its Lipschitz constant and $\bar{L}_{h}=\max\{L_{\nabla h},L_{h},1\}$ .

Assumption 4.4.

We assume that $p(\zeta)$ is $\sigma_{\zeta}$ -subgaussian.

This assumption is required for proving concentration—e.g., it is typically assumed in the context of safe reinforcement learning (Akametalu et al., 2014; Berkenkamp et al., 2017). In practice, perturbations due to noise are often bounded (which implies the noise is sub-Gaussian), especially for our setting of interest—e.g., forces due to wind, friction, or slippage have bounded magnituded. We are interested in settings where $\sigma_{\zeta}$ is small.

Definition 4.5.

A system is nearly deterministic if $\sigma_{\zeta}\ll 1$ .

In particular, we are interested in the dependence of the sample complexity on $\sigma_{\zeta}$ .

Main theorems. For the model-based policy gradient, we have:

Theorem 4.6.

For $\delta\leq 1/2$ , the sample complexity of $\hat{D}_{\text{MB}}(\theta)-\nabla_{\theta}J(\theta)$ satisfies

[TABLE]

For the policy gradient based on Theorem 3.1:

Theorem 4.7.

For the choice $p_{\xi}(\xi)=\mathcal{N}(\xi\mid\vec{0},\sigma_{\zeta}^{2}I_{d_{A}})$ , $\hat{D}_{\text{PG}}(\theta)-\nabla_{\theta}J(\theta)$ has sample complexity

[TABLE]

where $d=\max\{d_{S},d_{A}\}$ , for $\epsilon$ sufficiently small—i.e., $\epsilon=\Omega(T^{6}(L_{R}+L_{\tilde{R}_{\theta}})\bar{L}_{f}L_{\pi}\bar{L}_{\tilde{f}_{\theta}}^{T}d^{4})$ . Next,

[TABLE]

The first lower bound holds for any $p_{\xi}(\xi)$ that is everywhere differentiable on $\mathbb{R}$ and satisfies $\lim_{\xi\to\pm\infty}\xi\cdot p_{\xi}(\xi)=0$ , where $n_{\xi}$ is the sample complexity of estimating $\mathbb{E}_{p_{\xi}(\xi)}[\xi\cdot\nabla_{\xi}\log p_{\xi}(\xi)]$ using samples from $p_{\xi}$ . The second lower bound holds for $p_{\xi}(\xi)=\mathcal{N}(0,\sigma_{\xi}^{2})$ , for any $\sigma_{\xi}\in\mathbb{R}_{+}$ .

We have shown two lower bounds—one for an arbitrary distribution $p_{\xi}$ (in terms of a sample complexity $n_{\xi}$ related to $p_{\xi}$ ), and one for the specific choice where $p_{\xi}$ is Gaussian (as is the case in our upper bound). Also, note that our upper bound depends on choosing the action noise to have variance $\sigma_{\zeta}$ . In principle, the first lower bound holds even if $p_{\xi}$ depends on the problem parameters; however, then $n_{\xi}$ may depend on these parameters as well. The second lower bound is independent of the the action noise $\sigma_{\xi}$ , so it holds even if $\sigma_{\xi}$ depends on the problem parameters.

Remark 4.8.

Note that the upper and lower bounds have a gap on the order of $\epsilon^{1/2}$ . We believe that this gap is due to limitations in our analysis. In particular, our lower bounds depend on a lower bound on the tail of the $\chi_{n}^{2}$ distribution, which has exponential tails. In contrast, our other lower bounds depend on Gaussian tails, which are doubly exponential. Intuitively, since the $\chi_{n}^{2}$ distribution has a longer tail, it should not have lower sample complexity.

Remark 4.9.

Note that the second lower bound contains a dependence on $\delta^{-1/2}$ , which is unusual. However, this term only has a role if the first term in the minimum is very large. Furthermore, the first term depends as usual on $\log(1/\delta)$ (which is not shown since we omit log factors).

Remark 4.10.

Actor-critic approaches reduce variance by using function approximation to obtain lower variance estimates of the advantage $\tilde{A}_{\theta}^{(t)}$ (Schulman et al., 2015b). However, our lower bounds hold even if the advantage is known exactly. Thus, while actor-critic approaches can reduce variance, they do not affect our main insight that these estimates remain noisy for nearly deterministic dynamical systems.

For the finite-difference policy gradient:

Theorem 4.11.

The sample complexity of $\hat{D}_{\text{FD}}(\theta)-\nabla_{\theta}J(\theta)$ satisfies

[TABLE]

The first bound (i.e., the upper bound) holds for a choice $\lambda=O(\epsilon/T^{5}\bar{L}_{R_{\theta}}\bar{L}_{f_{\theta}}^{4T}d_{A})$ . The second bound (i.e., the lower bound) holds for any $\lambda\in\mathbb{R}_{+}$ , $\epsilon\leq 1$ , and $\delta\leq 1/2$ ,

Note that our upper bound is for the choice $\lambda=O(\epsilon)$ , but our lower bound holds for arbitrary $\lambda$ .

Remark 4.12.

In an abuse of notation, in Theorem 4.11, we have ignored the fact that $n_{\text{FD}}$ must always be at least $2d_{\Theta}$ ; in particular, it does not go to zero as $\sigma_{\zeta}$ goes to zero. This discrepancy in Theorem 4.11 arises because there is an implicit assumption we use when inverting Hoeffding’s inequality that $n\geq 1$ —more precisely, Hoeffding’s inequality gives a bound of the form

[TABLE]

where $\hat{\mu}_{X}^{(n)}$ is an estimate of $\mu_{X}=\mathbb{E}[X]$ using $n$ samples, and $\delta\geq e^{-n\epsilon^{2}/(2\sigma^{2})}$ . Solving for $n$ yields $n\geq 2\sigma^{2}\log(1/\delta)/\epsilon^{2}$ . However, if $\sigma=0$ , then $\delta$ is not well defined, so it does not mean we can get an estimate of $\mu_{X}$ using $n=0$ samples; instead, we need to take $n=1$ . In our proof of Theorem 4.11, we apply Hoeffding’s inequality $2d_{\Theta}$ times (since we estimate the gradient of each component separately), so we need $n\geq 2d_{\Theta}$ .

Proof strategy. We give a high-level overview of our proof strategy, focusing on Theorem 4.6. Our proof proceeds in two steps. First, we prove an upper bound

[TABLE]

where $E=T^{-1}\sum_{t=0}^{T-1}\|\zeta_{t}\|$ and $A,B\in\mathbb{R}_{+}$ do not depend on $\vec{\zeta}$ . This step uses induction based on the recursive structure of $V_{\theta}$ . Second, we prove Lemma G.7; we state a simplified version:

Lemma 4.13.

Let $X$ be a $\sigma_{X}$ -sub-Gaussian random vector over $\mathbb{R}^{d}$ , and let $Y$ be a random vector over $\mathbb{R}^{d^{\prime}}$ satisfying $\|Y\|\leq A\|X\|_{1}+B$ , where $A,B\in\mathbb{R}_{+}$ . Then $Y$ is $\sigma_{Y}$ -sub-Gaussian, where $\sigma_{Y}=\tilde{O}(A\sigma_{X}d+B)$ .

Combined with (1), we conclude that $\hat{D}_{\text{MB}}(\theta)-\nabla_{\theta}J(\theta)$ is sub-Gaussian, from which we can use Hoeffding’s inequality (see Lemma G.3) to complete the proof. For the lower bound, we construct a system where $J(\theta)$ is Gaussian. The proof of Theorem 4.7 follows similarly, except we need to use analogous results for sub-exponential random variables. In particular, we prove Lemma H.7, an analog of Lemma G.7. The proof of Theorem 4.11 also follows similarly, but we need to account for the bias in the finite-difference estimate of $\nabla_{\theta}J(\theta)$ from Theorem 3.3.

5 Discussion

Dependence on $\sigma_{\zeta}$ . Both $n_{\text{MB}}$ and $n_{\text{FD}}$ scale linearly in $\sigma_{\zeta}$ . Thus, the corresponding algorithms perform very well when $\sigma_{\zeta}$ is small. In contrast, $n_{\text{PG}}$ does not become small when $\sigma_{\zeta}$ becomes small. Intuitively, if $p_{\xi}$ is wide, then the action noise adds uncertainty to $\hat{D}_{\text{PG}}(\theta)$ . On the other hand, if $p_{\xi}$ is narrow, then $\nabla_{\theta}\log\tilde{\pi}_{\theta}(a\mid s)=\nabla_{\theta}\log p_{\xi}(a-\pi_{\theta}(s))$ becomes large—in particular, $p_{\xi}$ must change rapidly for some values of $\xi$ , and must have large gradient at such values of $\xi$ .

A key point is that in the first lower bound for $n_{\text{PG}}$ (i.e., for arbitrary $p_{\xi}$ ), even though we do not know its explicit dependence on $\epsilon$ , $\delta$ , $T$ , and $\bar{L}_{f_{\theta}}$ , we know that it is completely independent of $\sigma_{\zeta}$ . Thus, regardless of how $p_{\xi}$ is chosen (e.g., even if it chosen based on the problem parameters), the sample complexity does not become small as $\sigma_{\zeta}$ becomes small.

Full determinism ( $\sigma_{\zeta}=0$ ). When $\sigma_{\zeta}=0$ , we have $n_{\text{MB}}=1$ (i.e., we only need a single sample to estimate $\nabla_{\theta}J(\theta)$ ) and $n_{\text{FD}}=2d_{\Theta}$ (i.e., we need two samples to estimate the derivative of each parameter, taking $\lambda$ small enough to get $\epsilon$ error). For the case of $n_{\text{PG}}$ , our lower bound in Theorem 4.7 still holds—the dynamical system we use to obtain the lower bound has no noise in the dynamics. In particular, a large number of samples are still needed to obtain good estimates (i.e., possibly exponential in $T$ ).

Dependence on $\epsilon$ . Both $n_{\text{MB}}$ and $n_{\text{PG}}$ depend quadratically on $\epsilon$ (ignoring the gap between the upper and lower bounds for $n_{\text{PG}}$ ). In contrast, $n_{\text{FD}}$ depends quartically on $\epsilon$ . This gap arises because according to Theorem 3.3, the finite-differences error of $\hat{D}_{\text{FD}}(\theta)$ (assuming there is no noise) depends linearly on $\lambda$ . Thus, we must choose $\lambda=O(\epsilon)$ to obtain error at most $\epsilon$ . If the dynamical system and control policy are both linear, then this error goes away, so the dependence on $\epsilon$ becomes quadratic.

Dependence on $d_{\Theta}$ . Only $n_{\text{FD}}$ depends on $d_{\Theta}$ —whereas the other two algorithms make use of the fact that we can compute $\nabla_{\theta}\pi_{\theta}$ , the finite-difference approximation ignores this ability.

Dependence on $T$ . All of the sample complexities depend exponentially on $T$ . As we show in our lower bounds, this dependence is unavoidable—it arises from the fact that the dynamics cause the state (and therefore the rewards) to grow exponentially large in $T$ . A common assumption made in prior work is that the rewards are bounded uniformly by $R_{\text{max}}\in\mathbb{R}_{+}$ (Kearns and Singh, 2002; Kakade et al., 2003). Intuitively, our results indicate that without stronger assumptions, $R_{\text{max}}$ may be exponentially large. In practice, rewards for continuous control tasks are often quadratic, and can indeed be exponentially in magnitude.

An important aspect is that estimation is substantially easier when the current policy is good. In our bounds, the base of the exponential dependence is always $\bar{L}_{f_{\theta}}$ . If the initial policy $\pi_{\theta}$ provides relatively stable control, then we may expect that $L_{f_{\theta}}\leq 1$ —i.e., the states remain bounded in magnitude. Then, we have $\bar{L}_{f_{\theta}}=1$ , so our bounds no longer depend exponentially on $T$ . This insight suggests the importance of good initialization for fast estimation.

Indeed, policy gradient estimators can have high variance in practice. As an example, consider the cart-pole problem with continuous action space, with random initial state and where the reward function is the negative distance to origin. We empirically estimated that the MSE of the model-based policy gradient estimator using $n=1$ on a randomly initialized policy for this benchmark is $3.5\times 10^{7}$ . This error is substantially reduced when the policy is stable—for a trained cart-pole policy, we estimate that the MSE of the model-based policy gradient estimator is just $5.2\times 10^{-2}$ .

6 Experiments

We empirically evaluated the effect of $\sigma_{\zeta}$ on the performance of the different algorithms.

Dynamical system. We use the inverted pendulum (Tedrake, 2018) (specifically, using the dynamics from OpenAI Gym (Brockman et al., 2016)), which has state space $S=\mathbb{R}^{2}$ (i.e., angle $\vartheta$ and angular velocity $\omega$ ) and actions $A=\mathbb{R}$ (i.e., applied torque). Letting $f$ be the (deterministic) pendulum dynamics, we consider the system $s_{t+1}=f(s_{t},a_{t})+\zeta_{t}$ , where $\zeta_{t}\sim\mathcal{N}(0,\sigma_{\zeta}^{2})$ i.i.d. We use the rewards

[TABLE]

where $\vartheta_{0}$ is the angle corresponding to the upright position, and $w_{\vartheta}=1$ , $w_{\omega}=10^{-1}$ , and $w_{a}=10^{-2}$ . Our goal is to control the system over a horizon of $T=50$ steps, from a fixed start state $s_{0}=(\vartheta_{0}^{\prime},0)$ , where $\vartheta_{0}^{\prime}=0.05$ . For the control policy, we used a neural network $\pi_{\theta}$ with a single hidden layer with 100 neurons, ReLU activations, and linear outputs. As usual, we randomly initialize the weights; to reduce variance, we initialized the policy to have a reasonably high reward by running our model-based algorithm until $J(\theta)\geq-100$ .

Algorithms. We use stochastic gradient descent in conjunction with each of the three estimation algorithms. On each gradient step, we use a single sample to estimate the gradient, and we take 1000 gradient steps. We modify the finite-difference algorithm to use a single random sample $\nu\sim\text{Uniform}(S^{d_{\Theta}-1})$ (i.e., the uniform distribution on the unit sphere in $\mathbb{R}^{d_{\Theta}}$ ), rather than summing over the $d_{\Theta}$ basis vectors $\nu^{(k)}$ . This choice may improve the dependence of the sample complexity on $d_{\Theta}$ ; however, it should not affect dependence on $\sigma_{\zeta}$ , which is our parameter of interest.

For the algorithm based on the policy gradient theorem, we use action noise $\xi\sim\mathcal{N}(0,\sigma_{\xi}I_{d_{A}})$ . For each choice of $\sigma_{\zeta}$ , we used cross-validation to identify the optimal hyperparameters: the learning rate $\upsilon$ (for all algorithms), the parameter $\lambda$ (for the finite-differences algorithm), and the action noise $\sigma_{\xi}$ (for the algorithm based on the policy gradient theorem).

Results. Average the results of each algorithm over 20 runs; the algorithms have very high variance, so we discard runs that do not converge. In Figure 1, we show the learning curves for $\sigma_{\zeta}\in\{10^{-6},10^{-5},10^{-4},10^{-3},10^{-2},10^{-1}\}$ (i.e., $J(\theta)$ as a function of the number of gradient steps). The darker colors correspond to smaller noise. We show enlarged versions of these plots in Appendix I.

Note that unlike the other two algorithms, the finite-difference algorithm actually uses 2000 sampled rollouts (since it uses two per gradient step). However, this detail does not affect our insights regarding the relative convergence rate of different algorithms for different $\sigma_{\zeta}$ .

Our key finding is that the learning curves for the model-based and finite-differences are ordered based on the choice of $\sigma_{\zeta}$ —i.e., the curves tend to converge more quickly for smaller choices of $\sigma_{\zeta}$ . This effect is most apparent in the curves for the finite-differences algorithms, where curves for smaller $\sigma_{\zeta}$ (black and blue) converge much faster than those for larger $\sigma_{\zeta}$ (red and orange). In contrast, the learning curves for the policy gradient based algorithm do not have strong dependence on $\sigma_{\zeta}$ . For example, the fastest curve to converge (at least initially) for the policy gradient based algorithm is for our second-largest choice $\sigma_{\zeta}=10^{-2}$ (orange), whereas the slowest to converge is for $\sigma_{\zeta}=10^{-4}$ (blue). These results mirror our theoretical insights.

Finally, as expected, the model-based algorithm converges most quickly, followed by the finite-differences and policy gradient theorem based algorithms.

7 Conclusion

We have analyzed the sample complexity of algorithms for estimating the policy gradient for nearly deterministic dynamical systems. Future work includes leveraging these results in safe reinforcement learning algorithms, and understanding the sample complexity of optimizing $J(\theta)$ .

Acknowledgements

This work was supported by NSF CCF-1910769.

Appendix A Proof of Theorem 4.6

Preliminaries.

Note that the expected cumulative reward is equivalent to

[TABLE]

and the expected model-based policy gradient is

[TABLE]

Similarly, given a sample $\vec{\zeta}\sim p(\vec{\zeta})$ , the stochastic approximation of the expected cumulative reward is

[TABLE]

and the stochastic approximation of the model-based policy gradient is

[TABLE]

Bounding the deviation of $\nabla_{\theta}\hat{V}_{\theta}^{(t)}$ from $\nabla{\theta}V_{\theta}^{(t)}$ .

We claim that for $t\in\{0,1,...,T\}$ , we have

[TABLE]

for all $\theta\in\Theta$ and $s\in S$ , where

[TABLE]

where $L_{\nabla V}^{(t)}$ is a Lipschitz constant for $\nabla V_{\theta}^{(t)}$ . The base case $t=T$ follows trivially. Note that $\sigma_{\zeta}\sqrt{d_{S}}\geq\sqrt{\mathbb{E}_{p(\zeta)}[\|\zeta\|^{2}]}\geq\mathbb{E}_{p(\zeta)}[\|\zeta\|]$ . Then, for $t\in\{0,1,...,T-1\}$ , we have

[TABLE]

Similarly, we have

[TABLE]

The claim follows.

Bounding the deviation of $\nabla_{\theta}\hat{J}$ from $\nabla_{\theta}J$ .

We claim that

[TABLE]

where $E=T^{-1}\sum_{t=0}^{T-1}\|\zeta_{t}\|$ . To this end, letting $L_{\nabla V}=\operatorname*{\arg\max}_{t\in\{0,1,...,T\}}L_{\nabla V}^{(t)}$ , note that

[TABLE]

for $t\in\{1,2,...,T\}$ , so

[TABLE]

where the last step follows from our bound on $L_{\nabla V}^{(t)}$ in Lemma D.2.

Upper bound on sample complexity of $\nabla_{\theta}\hat{J}-\nabla_{\theta}J$ .

Note that $E\leq\|\vec{\zeta}\|_{1}$ , where we think of $\vec{\zeta}$ as the length $Td_{S}$ concatenation of the vectors $\zeta_{0},\zeta_{1},...,\zeta_{T-1}$ , so $\vec{\zeta}$ is $\sigma_{\zeta}$ -sub-Gaussian. We apply Lemma G.7 with

[TABLE]

Thus, $Y$ is $\sigma_{\text{MB}}$ -sub-Gaussian, where

[TABLE]

Thus, by Lemma G.6, the sample complexity of $\nabla_{\theta}\hat{J}(\theta)-\nabla_{\theta}J(\theta)$ is

[TABLE]

The claim follows.

Lower bound on sample complexity of $\nabla_{\theta}\hat{J}-\nabla_{\theta}J$ .

Consider a linear dynamical system with $S=A=\mathbb{R}$ , time-invariant deterministic transitions $f(s,a)=\beta s+a$ (where $\beta\in\mathbb{R}$ ), time-varying noise

[TABLE]

where $\sigma_{\zeta}\in\mathbb{R}$ , initial state $s_{0}=0$ , time-varying rewards

[TABLE]

control policy class $\pi_{\theta}(s)=\theta s$ , and current parameters $\theta=0$ . Note that

[TABLE]

where $\zeta=\zeta_{0}$ is the noise on the first step. Thus, we have

[TABLE]

so

[TABLE]

Also, note that

[TABLE]

Next, note that for $n$ i.i.d. samples $\zeta^{(1)},...,\zeta^{(n)}\sim\mathcal{N}(0,\sigma_{\zeta}^{2})$ , we have

[TABLE]

where

[TABLE]

[TABLE]

where $\xi=a-\pi_{\theta}(s)$ . Recall that $p_{\xi}(\xi)=\mathcal{N}(\vec{0},\sigma_{\zeta}^{2}I_{d_{A}})$ . Thus, we have

[TABLE]

Thus, we have

[TABLE]

as claimed.

Bounding the deviation of $\hat{D}_{\text{PG}}$ from $\nabla_{\theta}J$ .

We claim that

[TABLE]

where $L_{\tilde{V}}=\operatorname{\arg\max}_{t\in\{1,...,T\}}L_{\tilde{V}}^{(t)}$ , $E=T^{-1}\sum_{t=0}^{T-1}\|\zeta_{t}\|$ , and $\tilde{E}=T^{-1}\sum_{t=0}^{T-1}\|\xi_{t}\|$ . First, note that

[TABLE]

where the last step follows from the bound on $L_{\tilde{V}}^{(t)}$ in Lemma D.3. Then, we have

[TABLE]

Furthermore, we have

[TABLE]

where we have used the fact that $\mathbb{E}_{p(\vec{\zeta})}[E]=T^{-1}\sum_{t=0}^{T-1}\mathbb{E}_{p(\zeta_{t})}[\|\zeta_{t}\|]\leq\sigma_{\zeta}\sqrt{d}$ , and similarly $\mathbb{E}_{p_{\xi}(\xi)}[\tilde{E}]=T^{-1}\sum_{t=0}^{T-1}\mathbb{E}_{p_{\xi}(\xi)}[\|\xi_{t}\|]\leq\sigma_{\zeta}\sqrt{d}$ . Therefore, we have

[TABLE]

as claimed.

Upper bound on the sample complexity of $\hat{D}_{\text{PG}}-\nabla_{\theta}J$ .

We have $E^{\prime}=(\tilde{E}+E+2\sigma_{\zeta}\sqrt{d})\tilde{E}\leq\|\phi\|_{1}$ , where we think of $\phi$ as the $T^{2}(d_{A}+d_{S}+1)d_{A}$ values $\xi_{t,i}\xi_{t^{\prime},i^{\prime}}$ , $\zeta_{t,j}\xi_{t^{\prime},i^{\prime}}$ , and $2\sigma_{\zeta}\sqrt{d}\xi_{t^{\prime},i^{\prime}}$ , for all $t,t^{\prime}\in\{0,1,...,T-1\}$ , $i,i^{\prime}\in[d_{A}]$ , and $j\in[d_{S}]$ . Since $\xi_{t}$ and $\zeta_{t}$ are $\sigma_{\zeta}$ -sub-Gaussian for each $t\in T$ , by Lemma H.6, $\phi$ is $(\tau,b)$ -sub-exponential, where $\tau,b=O(d\sigma_{\zeta}^{2})$ . Thus, we can apply Lemma H.7 with

[TABLE]

Thus, $Y$ is $(\tau_{\text{PG}},b_{\text{PG}})$ -sub-exponential, where

[TABLE]

Thus, by Lemma G.6, the sample complexity of $\hat{D}_{\text{PG}}(\theta)-\nabla_{\theta}J(\theta)$ is

[TABLE]

for all $\epsilon\leq d\tau_{\text{PG}}^{2}/b_{\text{PG}}$ . The claim follows.

Lower bound on the sample complexity of $\hat{D}_{\text{PG}}-\nabla_{\theta}J$ .

Consider a linear dynamical system with $S=A=\mathbb{R}$ , time-varying deterministic transitions

[TABLE]

zero noise $p_{t}(\zeta)=\delta(0)$ (i.e., $\sigma_{\zeta}=0$ ), initial state $s_{0}=0$ , time-varying rewards

[TABLE]

control policy class $\pi_{\theta}(s)=\theta$ , current parameters $\theta=0$ , and action noise $p_{\xi}$ . Note that

[TABLE]

where $\xi_{t}\sim p_{\xi}(\xi)$ i.i.d., so

[TABLE]

where $\xi=\xi_{0}$ is the action noise on the first step. Note that

[TABLE]

and

[TABLE]

In particular, note that

[TABLE]

Also, note that $\nabla_{\theta}J(\theta)=\beta^{T-2}$ . Therefore, we have

[TABLE]

Thus, for i.i.d. samples $\xi^{(1)},...,\xi^{(n)}\sim p_{\xi}(\xi)$ , we have

[TABLE]

Note that for $p_{\xi}(\xi)$ satisfying our conditions (differentiable on $\mathbb{R}$ and satisfying $\lim_{\xi\to\pm\infty}\xi\cdot p_{\xi}(\xi)=0$ ), we have

[TABLE]

where the second-to-last step follows from integration by parts. Thus, by the definition of the sample complexity,

[TABLE]

for any $n<n_{\xi}(\epsilon,\delta)$ , so we have

[TABLE]

for any $n<n_{\xi}(\epsilon/\beta^{T-2},\delta)$ . Thus, we have

[TABLE]

Next, consider the case where $p_{\xi}(\xi)=\mathcal{N}(\xi\mid 0,\sigma^{2})$ , for any $\sigma\in\mathbb{R}_{+}$ . Then, we have

[TABLE]

so

[TABLE]

where $x^{(i)}\sim\mathcal{N}(0,1)$ are i.i.d. standard Gaussian random variables for $i\in[n]$ . By Lemma H.8, letting $x=n^{-1}\sum_{i=1}^{n}(x^{(i)})^{2}$ (so $\mu_{x}=\mathbb{E}_{p(x)}=1$ ), for

[TABLE]

we have

[TABLE]

Thus, the sample complexity of $\hat{D}_{\text{PG}}-\nabla_{\theta}J(\theta)$ satisfies

[TABLE]

Note that the numerator is positive as long as $\delta\leq 1/12$ . The claim follows, as does the theorem statement. ∎

Appendix C Proof of Theorem 4.11

Preliminaries.

Note that the expected cumulative reward is equivalent to

[TABLE]

Similarly, given a sample $\vec{\zeta}\sim p(\vec{\zeta})$ , the stochastic approximation of the expected cumulative reward is

[TABLE]

The finite difference approximation of $\nabla_{\theta}J(\theta)$ is

[TABLE]

where $\nu^{(k)}$ is a basis vector for $k\in[d]$ and $d_{\Theta}$ is the dimension of the parameter space $\Theta=\mathbb{R}^{d}$ . Finally, an estimate of the finite difference approximation for two samples $\zeta,\eta\sim\tilde{p}(\zeta)$ is

[TABLE]

where $\hat{J}(\theta;\vec{\zeta})$ is as defined in the proof of Theorem 4.6.

Bounding the deviation of $\hat{V}_{\theta}^{(t)}$ from $V_{\theta}^{(t)}$ .

We claim that for $t\in\{0,1,...,T\}$ , we have

[TABLE]

for all $\theta\in\Theta$ and $s\in S$ , where

[TABLE]

where $L_{V}^{(t)}$ is a Lipschitz constant for $V_{\theta}^{(t)}$ . The base case $t=T$ follows trivially. Note that $\sigma_{\zeta}\sqrt{d_{A}}\geq\sqrt{\mathbb{E}_{p(\zeta)}[\|\zeta\|^{2}]}\geq\mathbb{E}_{p(\zeta)}[\|\zeta\|]$ . Then, for $t\in\{0,1,...,T-1\}$ , we have

[TABLE]

The claim follows.

Bounding the deviation of $\hat{D}_{\text{FD}}$ from $D_{\text{FD}}$ .

Let

[TABLE]

Then, letting $L_{\nabla V}=\operatorname*{\arg\max}_{t\in\{0,1,...,T\}}L_{\nabla V}^{(t)}$ , note that

[TABLE]

where $E=T^{-1}\sum_{t=0}^{T-1}\|\zeta_{t}\|$ . Thus, we have

[TABLE]

for $k\in[d_{\Theta}]$ , where $\tilde{E}=T^{-1}\sum_{t=0}^{T-1}\|\eta_{t}\|$ .

Upper bound on the sample complexity of $\hat{D}_{\text{FD}}-D_{\text{FD}}$ .

Note that $E+\tilde{E}\leq\|E^{\prime}\|_{1}$ , where $E^{\prime}=\vec{\zeta}\circ\vec{\eta}$ is the length $2Td_{S}$ concatenation of the vectors $\zeta_{0},\zeta_{1},...,\zeta_{T-1},\eta_{0},\eta_{1},...,\eta_{T-1}$ , so $E^{\prime}$ is $\sigma_{\zeta}$ -sub-Gaussian. We apply Lemma G.7 with

[TABLE]

Thus, $Y$ is $\sigma_{\text{FD}}$ -sub-Gaussian, where

[TABLE]

Thus, by Lemma G.6, for $k\in[d_{\Theta}]$ , the sample complexity of $\hat{D}_{\text{FD}}(\theta)_{k}-D_{\text{FD}}(\theta)_{k}$ is

[TABLE]

Upper bound on the sample complexity of $\hat{D}_{\text{FD}}-\nabla_{\theta}J(\theta)$ .

By Theorem 3.3, we have

[TABLE]

where

[TABLE]

where the second inequality follows from the fact that $L_{\nabla J}=L_{\nabla V}^{(0)}$ and the bound on $L_{\nabla V}^{(0)}$ in Lemma D.2. Now, taking

[TABLE]

then with probability $1-\delta$ , we have

[TABLE]

so the sample complexity of $\hat{D}_{\text{FD}}(\theta)-\nabla_{\theta}J(\theta)$ is

[TABLE]

The claim follows.

Lower bound on the sample complexity of $\hat{D}_{\text{FD}}-\nabla_{\theta}J(\theta)$ .

Consider a linear dynamical system with $S=\mathbb{R}^{2}$ , $A=\mathbb{R}$ , time-varying deterministic transitions

[TABLE]

time-varying noise

[TABLE]

where $\sigma_{\zeta}\in\mathbb{R}$ , initial state $s_{0}=(0,0)$ , time-varying rewards

[TABLE]

where $\phi:\mathbb{R}\to\mathbb{R}$ is defined by

[TABLE]

control policy class $\pi_{\theta}((s,s^{\prime}))=\theta$ , and current parameters $\theta=0$ . Note that technically, $R$ is not twice continuously differentiable, so it does not satisfy Assumption 4.2. However, the only place in the proof of Theorem 4.11 where we need this assumption is to apply Lemma F.2 in Lemma D.2. By the discussion in the proof of Lemma F.2, the lemma still applies, so our theorems still apply to this dynamical system. Now, we have

[TABLE]

where $\zeta=\zeta_{0}$ is the noise on the first step. Thus, we have

[TABLE]

Also, note that

[TABLE]

so $\nabla_{\theta}J(0)=0$ , since $\phi^{\prime}(0)=0$ .

Next, note that for $2n$ i.i.d. samples $\zeta^{(1)},...,\zeta^{(n)},\eta^{(1)},...,\eta^{(n)}\sim\mathcal{N}(0,\sigma_{\zeta}^{2})$ , we have

[TABLE]

Letting $\zeta^{(n+i)}=-\eta^{(i)}$ for $i\in[n]$ , and using the fact that $\phi(-x)=-\phi(x)$ , we have

[TABLE]

where

[TABLE]

Thus, by Lemma G.8, for

[TABLE]

and recalling that $D_{\text{FD}}(\theta)=\mathbb{E}_{p_{\theta}(\alpha)}[\hat{D}_{\text{FD}}(\theta;\alpha)]=\mu_{\text{FD}}$ , we have

[TABLE]

Thus, the sample complexity of $\hat{D}_{\text{FD}}(0)-D_{\text{FD}}(0)$ satisfies

[TABLE]

Now, recall that $\nabla_{\theta}J(0)=0$ , so

[TABLE]

Thus, using our assumption $\delta\leq 1/2$ , then we need to have $\mu_{\text{FD}}\leq\epsilon$ for $\text{Pr}\left[\hat{D}_{\text{FD}}(0)-\nabla_{\theta}J(0)\geq\epsilon\right]\leq\delta$ to hold. As a consequence, using our assumption $\epsilon\leq 1$ , we must have

[TABLE]

where the last step follows since $0\leq\phi(\beta^{T-2}\lambda)\leq 1$ implies $\phi(x)=x^{2}$ . Thus, we have $\lambda\leq\sqrt{\frac{\epsilon}{\beta^{2(T-2)}}}$ , so we have $\sigma_{\text{FD}}\geq\beta^{4(T-2)}\sigma_{\zeta}^{2}/\epsilon$ . Finally, we have

[TABLE]

so the sample complexity of $\hat{D}_{\text{FD}}(0)-\nabla_{\theta}J(\theta)$ satisfies

[TABLE]

Finally, for any $d_{\Theta}\in\mathbb{N}$ , we can consider $d_{\Theta}$ independent copies of this dynamical system. Then, estimating the gradient $\nabla_{\theta}J(\theta)$ is equivalent to estimating $\frac{dJ}{d\theta_{i}}(\theta)$ for each $i\in[d_{\Theta}]$ . Thus, we have

[TABLE]

The claim follows, as does the theorem statement. ∎

Appendix D Bounds on Lipschitz Constants

We prove bounds on the Lipschitz constants $L_{V}^{(t)}$ for $V_{\theta}^{(t)}$ , $L_{\nabla V}^{(t)}$ for $\nabla V_{\theta}^{(t)}$ , and $L_{\tilde{V}}^{(t)}$ for $\tilde{V}_{\theta}^{(t)}$ . We use implicitly use the commonly known results in Appendix F throughout these proofs.

Lemma D.1.

We claim that for $t\in\{0,1,...,T\}$ , $V_{\theta}^{(t)}$ is $L_{V}^{(t)}$ -Lipschitz, where

[TABLE]

Proof.

First, we show that $V_{\theta}^{(t)}$ is $L_{V,\theta}^{(t)}$ -Lipschitz in $\theta$ and $L_{V,s}^{(t)}$ -Lipschitz in $s$ , where

[TABLE]

We prove by induction. The base case $t=T$ is trivial. Then, for $t\in\{0,1,...,T-1\}$ , note that $V_{\theta}^{(t)}$ is $(L_{V,\theta}^{(t)})^{\prime}$ -Lipschitz in $\theta$ , where

[TABLE]

Similarly, note that $V_{\theta}^{(t)}$ is $(L_{V,s}^{(t)})^{\prime}$ -Lipschitz in $s$ , where

[TABLE]

as was to be shown. Finally, note that

[TABLE]

so

[TABLE]

Thus, $V_{\theta}^{(T)}$ is $(L_{V}^{(t)})^{\prime}$ -Lipschitz, where

[TABLE]

The claim follows. ∎

Lemma D.2.

We claim that for $t\in\{0,1,...,T\}$ , $\nabla V_{\theta}^{(t)}$ is $L_{\nabla V}^{(t)}$ -Lipschitz, where

[TABLE]

Proof.

First, we show that $\nabla_{\theta}V_{\theta}^{(t)}$ is $L_{\nabla V,\theta,\theta}^{(t)}$ -Lipschitz in $\theta$ and $L_{\nabla V,\theta,s}^{(t)}$ -Lipschitz in $s$ , and that $\nabla_{s}V_{\theta}^{(t)}$ is $L_{\nabla V,\theta,s}^{(t)}$ -Lipschitz in $\theta$ and $L_{\nabla V,s,s}^{(t)}$ -Lipschitz in $s$ , where

[TABLE]

We prove by induction. The base case $t=T$ is trivial. First, for $t\in\{0,1,...,T-1\}$ , note that $\nabla_{\theta}V_{\theta}^{(t)}$ is $(L_{\nabla V,\theta,\theta}^{(t)})^{\prime}$ -Lipschitz in $\theta$ , where

[TABLE]

Second, note that $\nabla_{\theta}V_{\theta}^{(t)}$ is $(L_{\nabla V,\theta,s}^{(t)})^{\prime}$ -Lipschitz in $s$ , where

[TABLE]

Third, note that $\nabla_{s}V_{\theta}^{(t)}$ is $(L_{\nabla V,s,\theta}^{(t)})^{\prime}$ -Lipschitz in $\theta$ , where

[TABLE]

Fourth, note that $\nabla_{s}V_{\theta}^{(t)}$ is $(L_{\nabla V,s,s}^{(t)})^{\prime}$ -Lipschitz in $s$ , where

[TABLE]

as was to be shown. Finally, note that

[TABLE]

so

[TABLE]

so

[TABLE]

Thus, $\nabla V_{\theta}^{(t)}$ is $(L_{\nabla V}^{(t)})^{\prime}$ -Lipschitz, where

[TABLE]

The claim follows. ∎

Lemma D.3.

We claim that for $t\in\{0,1,...,T\}$ , $\tilde{V}_{\theta}^{(t)}$ is $L_{\tilde{V}}^{(t)}$ -Lipschitz, where

[TABLE]

Proof.

Note that $\tilde{V}_{\theta}^{(t)}$ is exactly equal to $V_{\theta}^{(t)}$ with $R_{\theta}$ replaced with $\tilde{R}_{\theta}$ and $f_{\theta}$ replaced with $\tilde{f}_{\theta}$ . Thus, the claim follows by the same argument as for Lemma D.1. ∎

Appendix E Proof of Theorem 3.3

Theorem E.1.

(Taylor’s theorem) Let $f:\mathbb{R}\to\mathbb{R}$ be an everywhere differentiable function with $L_{f^{\prime}}$ -Lipschitz derivative. Then, for any $x,\epsilon\in\mathbb{R}$ , we have

[TABLE]

where

[TABLE]

Proof.

The claim follows from Theorem 5.15 in Rudin et al. (1976), together with Lemma F.2, which implies that $|f^{\prime\prime}(x)|\leq L_{f^{\prime}}$ for all $x\in\mathbb{R}$ . ∎

Now, we prove Theorem 3.3. By Taylor’s theorem, we have

[TABLE]

where

[TABLE]

Thus, we have

[TABLE]

Therefore, we have

[TABLE]

so

[TABLE]

as claimed. ∎

Appendix F Technical Lemmas (Lipschitz Constants)

We define Lipschitz continuity (for the $L_{2}$ norm), and prove a number of standard results.

Definition F.1.

A function $f:\mathcal{X}\to\mathcal{Y}$ (where $\mathcal{X}\subseteq\mathbb{R}^{d}$ and $\mathcal{Y}\subseteq\mathbb{R}^{d^{\prime}}$ ) is $L_{f}$ -Lipschitz continuous if for all $x,x^{\prime}\in\mathcal{X}$ ,

[TABLE]

If $\mathcal{X}$ is a space of matrices or tensors, we assume $x$ and $x^{\prime}$ are unrolled into vectors. in (3).

Lemma F.2.

If $f:\mathcal{X}\to\mathcal{Y}$ is $L_{f}$ -Lipschitz and continuously differentiable, then for all $x\in\mathcal{X}$ ,

[TABLE]

Proof.

Note that

[TABLE]

so

[TABLE]

(Hoeffding’s inequality) Let $x_{1},...,x_{n}\sim p_{X}(x)$ be i.i.d. $\sigma_{X}$ -sub-Gaussian random variables over $\mathbb{R}$ . Then,

[TABLE]

Proof.

See Proposition 2.1 of Wainwright (2019). ∎

Definition G.4.

A random vector $X$ over $\mathbb{R}^{d}$ is $\sigma_{X}$ -sub-Gaussian if each $X_{i}$ is $\sigma_{X}$ -sub-Gaussian.

Lemma G.5.

If a random vector $X$ over $\mathbb{R}^{d}$ is $\sigma_{X}$ -sub-Gaussian, then $\mathbb{E}[\|X\|]\leq\sigma_{X}\sqrt{d}$ .

Proof.

Note that

[TABLE]

where the first inequality follows from Jensen’s inequality. ∎

Lemma G.6.

Let $X$ be random vector over $\mathbb{R}^{d}$ with mean $\mu_{X}=\mathbb{E}[X]$ , such that $X-\mu_{X}$ is $\sigma_{X}$ -sub-Gaussian. Then, given $\epsilon,\delta\in\mathbb{R}_{+}$ , the sample complexity of $X$ satisfies

[TABLE]

i.e., given $x_{1},...,x_{n}\sim p_{X}(x)$ i.i.d. samples of $X$ with empirical mean $x=n^{-1}\sum_{i=1}^{n}x_{n}$ , then $\text{Pr}[\|x-\mu_{X}\|\geq\epsilon]\leq\delta$ .

Proof.

Note that

[TABLE]

as claimed. ∎

Lemma G.7.

Let $X$ be a $\sigma_{X}$ -sub-Gaussian random vector over $\mathbb{R}^{d}$ , and let $Y$ be a random vector over $\mathbb{R}^{d^{\prime}}$ satisfying

[TABLE]

where $A,B\in\mathbb{R}_{+}$ . Then $Y$ is $\sigma_{Y}$ -sub-Gaussian, where

[TABLE]

Proof.

We first prove that $|Y_{i}|$ is bounded for each $i\in[d]$ , and then use this fact to prove that $Y_{i}$ is sub-Gaussian. In particular, we claim that for any $i\in[d]$ and any $t\in\mathbb{R}_{+}$ , we have

[TABLE]

where

[TABLE]

To this end, note that by Theorem 5.1 in Lattimore and Szepesvári (2018), for any $i\in[d]$ and any $t\in\mathbb{R}_{+}$ , we have

[TABLE]

Now, note that

[TABLE]

We consider three cases. First, suppose that $t\geq\max\{4A\sigma_{X}d\sqrt{\log d},2B\}$ . Then, $(t-B)^{2}\geq(t/2)^{2}$ , so

[TABLE]

Furthermore, $t^{2}-(Ad\sigma_{X}\sqrt{8})^{2}\log d\geq(t^{2}/2)$ , so

[TABLE]

Second, if $t\leq 2B$ , then

[TABLE]

so

[TABLE]

Third, if $t\leq 4A\sigma_{X}d\sqrt{\log d}$ , then

[TABLE]

so

[TABLE]

As a consequence, by Note 5.4.2 in Lattimore and Szepesvári (2018), $Y_{i}$ is $\tilde{\sigma}_{Y}\sqrt{5}$ -sub-Gaussian. Note that $\sigma_{Y}\geq\tilde{\sigma}_{Y}\sqrt{5}$ , so the theorem follows. ∎

Lemma G.8.

Given $\sigma\in\mathbb{R}_{+}$ ,

[TABLE]

Proof.

By Theorem 2 in Chang et al. (2011), we have

[TABLE]

where $\Phi(t)$ is the cumulative distribution function of $\mathcal{N}(0,1)$ . Thus, for $\epsilon\in\mathbb{R}_{+}$ , we have

[TABLE]

The claim follows. ∎

Appendix H Technical Lemmas (Sub-Exponential Random Variables)

We define sub-exponential random variables, and prove a number of standard results. Additionally, we prove Lemma H.7 (an analog of Lemma G.7), a key lemma that enables us to infer a sub-exponential constant for a random variable bounded $Y$ in norm by a sub-exponential random variable $X$ , i.e., $\|Y\|\leq A\|X\|_{1}+B$ (where $\|\cdot\|$ is the $L_{2}$ norm). This lemma is a key step in the proof of our upper bound in Theorem 4.7. Finally, we also prove Lemma H.8, which is a key step in the proof of our lower bound in Theorem 4.7.

Definition H.1.

A random variable $X$ over $\mathbb{R}$ is $(\tau_{X},b_{X})$ -sub-exponential if $\mathbb{E}[X]=0$ , and for all $t\in\mathbb{R}$ satisfying $|t|\leq b_{X}^{-1}$ , we have $\mathbb{E}[e^{tX}]\leq e^{\tau_{X}^{2}t^{2}/2}$ .

Lemma H.2.

Let $x_{1},...,x_{n}\sim p_{X}(x)$ be i.i.d. $(\tau_{X},b_{X})$ -sub-exponential random variables over $\mathbb{R}$ . Then, we have

[TABLE]

Proof.

See (2.20) in Wainwright (2019). ∎

Definition H.3.

A random vector $X$ over $\mathbb{R}^{d}$ is $(\tau_{X},b_{X})$ -sub-exponential if each $X_{i}$ is $(\tau_{X},b_{X})$ -sub-exponential.

Lemma H.4.

Let $X$ be a random vector over $\mathbb{R}^{d}$ with mean $\mu_{X}=\mathbb{E}[X]$ , such that $X-\mu_{X}$ is $(\tau_{X},b_{X})$ -sub-exponential. Then, given $\epsilon,\delta\in\mathbb{R}_{+}$ such that $\epsilon\leq d\tau_{X}^{2}/b_{X}$ , the sample complexity of $X$ satisfies

[TABLE]

i.e., given $x_{1},...,x_{n}\sim p_{X}(x)$ i.i.d. samples of $X$ with empirical mean $x=n^{-1}\sum_{i=1}^{n}x_{n}$ , then $\text{Pr}[\|x-\mu_{X}\|\geq\epsilon]\leq\delta$ .

Proof.

Note that

[TABLE]

as claimed. ∎

Lemma H.5.

Let $X$ be $\sigma_{X}$ -sub-Gaussian. Then, $X^{2}$ is $(\tau_{X},b_{X})$ -sub-exponential, where $\tau_{X},b_{X}=O(\sigma_{X}^{2})$ .

Proof.

The result follows from Lemma 5.5, Lemma 5.14, and the discussion preceding Definition 5.13 in Vershynin (2010). In particular, using the notation in Vershynin (2010), by Lemma 5.5, we have that $X$ satisfies $\|X\|_{\psi_{2}}=O(\sigma_{X})$ . Then, by Lemma 5.14, we have that $\|X^{2}\|_{\psi_{1}}=2\|X\|_{\psi_{2}}^{2}=O(\sigma_{X}^{2})$ . Finally, by the discussion preceding Definition 5.13, we have that $X^{2}$ is $(\tau_{X},b_{X})$ -sub-exponential with parameters $\tau_{X},b_{X}=O(\|X^{2}\|_{\psi_{1}})=O(\sigma_{X}^{2})$ . The claim follows. ∎

Lemma H.6.

Let $X$ and $Y$ be $\sigma_{X}$ -sub-Gaussian, respectively. Then, $Z=XY$ is $(\tau_{Z},b_{Z})$ -sub-exponential, where $\tau_{Z},b_{Z}=O(\sigma_{X}^{2})$ .

Proof.

Note that

[TABLE]

By Lemma H.5, we have $X+Y$ and $X-Y$ are $(\tau,b)$ -sub-exponential for $\tau,b=O(\sigma_{X}^{2})$ , so $Z$ is $\tau_{Z},b_{Z}$ -sub-exponential, for $\tau_{Z},b_{Z}=O(\tau+b)=O(\sigma_{X}^{2})$ , as claimed. ∎

Lemma H.7.

Let $X$ be a $(\tau_{X},b_{X})$ -sub-exponential random vector over $\mathbb{R}^{d}$ , and let $Y$ be a random vector over $\mathbb{R}^{d^{\prime}}$ satisfying

[TABLE]

where $A,B\in\mathbb{R}_{+}$ . Then $Y$ is $(\tau_{Y},b_{Y})$ -sub-exponential, where $\tau_{Y},b_{Y}=O(A(\tau_{X}+b_{X})d\log d+B)$ .

Proof.

We use Lemma 5.14 and the discussion preceding Definition 5.13 in Vershynin (2010). In particular, let $\tilde{\tau}_{X}=\max\{\tau_{X},b_{X}\}$ ; then, from the definition of sub-exponential random variables with $t=\tilde{\tau}_{X}^{-1}$ , we have

[TABLE]

for each $i\in[d]$ . Thus, using the notation in Vershynin (2010), so by the discussion preceding the Definition 5.13 in Vershynin (2010), we have $X_{i}$ satisfies $\|X_{i}\|_{\psi_{1}}=O(\tilde{\tau}_{X})$ , and furthermore satisfies

[TABLE]

for all $t\in\mathbb{R}_{+}$ , where $K=O(\|X_{i}\|_{\psi_{1}})=O(\tilde{\tau}_{X})$ . Thus, for each $i\in[d]$ , we have

[TABLE]

Now, let

[TABLE]

We consider three cases. First, suppose that $t\geq\max\{4AKd\log d,2B\}$ . Then, $t-B\geq t/2$ , so

[TABLE]

Furthermore, $t-2AKd\log d\geq t/2$ , so

[TABLE]

Second, if $t\leq 2B$ , then

[TABLE]

so

[TABLE]

Third, if $t\leq 4AKd\log d$ , then

[TABLE]

so

[TABLE]

As a consequence, by the discussion preceding Definition 5.13 in Vershynin (2010), we have $Y_{i}$ satisfies $\|Y_{i}\|_{\psi_{1}}=O(\tilde{\tau}_{Y})$ . Thus, by Lemma 5.15 in Vershynin (2010), we have that $Y_{i}$ is $(\tau_{Y},b_{Y})$ -sub-exponential, where

[TABLE]

The claim follows. ∎

Lemma H.8.

Given $\sigma\in\mathbb{R}_{+}$ , let

[TABLE]

where $x^{(1)},...,x^{(n)}\sim\mathcal{N}(0,\sigma^{2})$ i.i.d., and let $\mu_{x}=\mathbb{E}_{p(x)}[x]=\sigma^{2}$ . Then, we have

[TABLE]

Proof.

Let $z=(z^{(1)})^{2}+...+(z^{(n)})^{2}$ be the sum of the squares of $n$ i.i.d. standard Gaussian random variables $z^{(1)},...,z^{(n)}\sim\mathcal{N}(0,1)$ . We assume that $n=2k$ is even. Then, $z$ is distributed according to the $\chi_{2k}^{2}$ distribution, which has density function

[TABLE]

and mean $\mu_{2k}=2k$ . For $z\geq\mu_{2k}=2k$ , we have

[TABLE]

where the second inequality follows from a result

[TABLE]

based on Stirling’s approximation Robbins (1955). Thus, for any $\epsilon\in\mathbb{R}_{+}$ , we have

[TABLE]

Finally, for $x=((x^{(1)})^{2}+...+(x^{(n)})^{2})/n$ , where $x^{(1)},...,x^{(n)}\sim\mathcal{N}(0,\sigma^{2})$ i.i.d., note that $x=\frac{\sigma^{2}z}{n}$ and

[TABLE]

so we have

[TABLE]

The claim follows. ∎

Appendix I Experimental Results

We show enlarged versions of the plots from Figure 1:

Model-Based Algorithm

Finite-Differences Algorithm

Policy Gradient Theorem Algorithm

Bibliography59

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abbeel et al. (2007) Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y Ng. An application of reinforcement learning to aerobatic helicopter flight. In Advances in neural information processing systems , pages 1–8, 2007.
2Akametalu et al. (2014) Anayo K Akametalu, Shahab Kaynama, Jaime F Fisac, Melanie Nicole Zeilinger, Jeremy H Gillula, and Claire J Tomlin. Reachability-based safe learning with gaussian processes. In CDC , pages 1424–1431. Citeseer, 2014.
3Andrychowicz et al. (2018) Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob Mc Grew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. ar Xiv preprint ar Xiv:1808.00177 , 2018.
4Antos et al. (2008) András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning , 71(1):89–129, 2008.
5Berkenkamp et al. (2017) Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems , pages 908–918, 2017.
6Bottou and Bousquet (2008) Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural information processing systems , pages 161–168, 2008.
7Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540 , 2016.
8Chang et al. (2011) Seok-Ho Chang, Pamela C Cosman, and Laurence B Milstein. Chernoff-type bounds for the gaussian error function. IEEE Transactions on Communications , 59(11):2939–2944, 2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Abstract

1 Introduction

2 Preliminaries

Remark 2.1**.**

3 Policy Gradient Algorithms

Theorem 3.1**.**

Remark 3.2**.**

Theorem 3.3**.**

Remark 3.4**.**

4 Main Results

Definition 4.1**.**

Assumption 4.2**.**

Remark 4.3**.**

Assumption 4.4**.**

Definition 4.5**.**

Theorem 4.6**.**

Theorem 4.7**.**

Remark 4.8**.**

Remark 4.9**.**

Remark 4.10**.**

Theorem 4.11**.**

Remark 4.12**.**

Lemma 4.13**.**

5 Discussion

6 Experiments

7 Conclusion

Acknowledgements

Appendix A Proof of Theorem 4.6

Preliminaries.

Bounding the deviation of ∇θV^θ(t)\nabla_{\theta}\hat{V}_{\theta}^{(t)}∇θ​V^θ(t)​ from ∇θVθ(t)\nabla{\theta}V_{\theta}^{(t)}∇θVθ(t)​.

Bounding the deviation of ∇θJ^\nabla_{\theta}\hat{J}∇θ​J^ from ∇θJ\nabla_{\theta}J∇θ​J.

Upper bound on sample complexity of ∇θJ^−∇θJ\nabla_{\theta}\hat{J}-\nabla_{\theta}J∇θ​J^−∇θ​J.

Lower bound on sample complexity of ∇θJ^−∇θJ\nabla_{\theta}\hat{J}-\nabla_{\theta}J∇θ​J^−∇θ​J.

Appendix B Proof of Theorem 4.7

Preliminaries.

Bounding Q^θ(t)−V~θ(t)\hat{Q}_{\theta}^{(t)}-\tilde{V}_{\theta}^{(t)}Q^​θ(t)​−V~θ(t)​.

Bounding log⁡π~θ(a∣s)\log\tilde{\pi}_{\theta}(a\mid s)logπ~θ​(a∣s).

Bounding the deviation of D^PG\hat{D}_{\text{PG}}D^PG​ from ∇θJ\nabla_{\theta}J∇θ​J.

Upper bound on the sample complexity of D^PG−∇θJ\hat{D}_{\text{PG}}-\nabla_{\theta}JD^PG​−∇θ​J.

Lower bound on the sample complexity of D^PG−∇θJ\hat{D}_{\text{PG}}-\nabla_{\theta}JD^PG​−∇θ​J.

Appendix C Proof of Theorem 4.11

Preliminaries.

Bounding the deviation of V^θ(t)\hat{V}_{\theta}^{(t)}V^θ(t)​ from Vθ(t)V_{\theta}^{(t)}Vθ(t)​.

Bounding the deviation of D^FD\hat{D}_{\text{FD}}D^FD​ from DFDD_{\text{FD}}DFD​.

Upper bound on the sample complexity of D^FD−DFD\hat{D}_{\text{FD}}-D_{\text{FD}}D^FD​−DFD​.

Upper bound on the sample complexity of D^FD−∇θJ(θ)\hat{D}_{\text{FD}}-\nabla_{\theta}J(\theta)D^FD​−∇θ​J(θ).

Lower bound on the sample complexity of D^FD−∇θJ(θ)\hat{D}_{\text{FD}}-\nabla_{\theta}J(\theta)D^FD​−∇θ​J(θ).

Appendix D Bounds on Lipschitz Constants

Lemma D.1**.**

Proof.

Lemma D.2**.**

Proof.

Lemma D.3**.**

Proof.

Appendix E Proof of Theorem 3.3

Theorem E.1**.**

Proof.

Appendix F Technical Lemmas (Lipschitz Constants)

Definition F.1**.**

Lemma F.2**.**

Proof.

Lemma F.3**.**

Proof.

Lemma F.4**.**

Proof.

Lemma F.5**.**

Proof.

Lemma F.6**.**

Proof.

Lemma F.7**.**

Proof.

Lemma F.8**.**

Proof.

Remark 2.1.

Theorem 3.1.

Remark 3.2.

Theorem 3.3.

Remark 3.4.

Definition 4.1.

Assumption 4.2.

Remark 4.3.

Assumption 4.4.

Definition 4.5.

Theorem 4.6.

Theorem 4.7.

Remark 4.8.

Remark 4.9.

Remark 4.10.

Theorem 4.11.

Remark 4.12.

Lemma 4.13.

Bounding the deviation of $\nabla_{\theta}\hat{V}_{\theta}^{(t)}$ from $\nabla{\theta}V_{\theta}^{(t)}$ .

Bounding the deviation of $\nabla_{\theta}\hat{J}$ from $\nabla_{\theta}J$ .

Upper bound on sample complexity of $\nabla_{\theta}\hat{J}-\nabla_{\theta}J$ .

Lower bound on sample complexity of $\nabla_{\theta}\hat{J}-\nabla_{\theta}J$ .

Bounding $\hat{Q}_{\theta}^{(t)}-\tilde{V}_{\theta}^{(t)}$ .

Bounding $\log\tilde{\pi}_{\theta}(a\mid s)$ .

Bounding the deviation of $\hat{D}_{\text{PG}}$ from $\nabla_{\theta}J$ .

Upper bound on the sample complexity of $\hat{D}_{\text{PG}}-\nabla_{\theta}J$ .

Lower bound on the sample complexity of $\hat{D}_{\text{PG}}-\nabla_{\theta}J$ .

Bounding the deviation of $\hat{V}_{\theta}^{(t)}$ from $V_{\theta}^{(t)}$ .

Bounding the deviation of $\hat{D}_{\text{FD}}$ from $D_{\text{FD}}$ .

Upper bound on the sample complexity of $\hat{D}_{\text{FD}}-D_{\text{FD}}$ .

Upper bound on the sample complexity of $\hat{D}_{\text{FD}}-\nabla_{\theta}J(\theta)$ .

Lower bound on the sample complexity of $\hat{D}_{\text{FD}}-\nabla_{\theta}J(\theta)$ .

Lemma D.1.

Lemma D.2.

Lemma D.3.

Theorem E.1.

Definition F.1.

Lemma F.2.

Lemma F.3.

Lemma F.4.

Lemma F.5.

Lemma F.6.

Lemma F.7.

Lemma F.8.

Definition G.1.

Lemma G.2.

Lemma G.3.

Definition G.4.

Lemma G.5.

Lemma G.6.

Lemma G.7.

Lemma G.8.

Definition H.1.

Lemma H.2.

Definition H.3.

Lemma H.4.

Lemma H.5.

Lemma H.6.

Lemma H.7.

Lemma H.8.