Why Target Networks Stabilise Temporal Difference Methods
Mattie Fellows, Matthew J. A. Smith, Shimon Whiteson

TL;DR
This paper provides a theoretical explanation for why target networks stabilize temporal difference learning in deep reinforcement learning, showing they mitigate poor conditioning and can guarantee convergence under certain conditions.
Contribution
It formalizes the concept of partially fitted policy evaluation, characterizes the deadly triad, and explains how target networks improve stability and convergence in complex RL settings.
Findings
Target networks mitigate poor conditioning in TD updates.
Proper tuning of target network update frequency ensures convergence.
The framework bridges fitted methods and semigradient TD algorithms.
Abstract
Integral to recent successes in deep reinforcement learning has been a class of temporal difference methods that use infrequently updated target values for policy evaluation in a Markov Decision Process. Yet a complete theoretical explanation for the effectiveness of target networks remains elusive. In this work, we provide an analysis of this popular class of algorithms, to finally answer the question: `why do target networks stabilise TD learning'? To do so, we formalise the notion of a partially fitted policy evaluation method, which describes the use of target networks and bridges the gap between fitted methods and semigradient temporal difference algorithms. Using this framework we are able to uniquely characterise the so-called deadly triad - the use of TD updates with (nonlinear) function approximation and off-policy data - which often leads to nonconvergent algorithms. This…
| Parameter | Value |
| Environment Parameters | |
| 0.99 | |
| Architecture Parameters | |
| MLP Hidden Layers | 2 |
| Hidden Layer Size | 32 |
| Nonlinearity | ReLU |
| 0.05 | |
| Training Parameters | |
| Total Target Network Updates | 500 |
| Learning Rate | [0.001, 0.0005] |
| Momentum () | [0, 0.01] |
| Batch Size | 500 |
| Steps per Target Network Update () | 5 |
| Data Gathering Steps per Update | 5 |
| Replay Buffer Size | 2500 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
Why Target Networks Stabilise Temporal Difference Methods
Mattie Fellows
Matthew J.A. Smith
Shimon Whiteson
Abstract
Integral to many recent successes in deep reinforcement learning has been a class of temporal difference methods that use infrequently updated target values for policy evaluation in a Markov Decision Process. At the same time, a complete theoretical explanation for the effectiveness of target networks remains elusive. In this work, we provide an analysis of this popular class of algorithms, to finally answer the question: “why do target networks stabilise TD learning”? To do so, we formalise the notion of a partially fitted policy evaluation method, which describes the use of target networks and bridges the gap between fitted methods and semigradient temporal difference algorithms. Using this framework we are able to uniquely characterise the so-called deadly triad–the use of TD updates with (nonlinear) function approximation and off-policy data–which often leads to nonconvergent algorithms. This insight leads us to conclude that the use of target networks can mitigate the effects of poor conditioning in the Jacobian of the TD update. Furthermore, we show that under mild regularity conditions and a well tuned target network update frequency, convergence can be guaranteed even in the extremely challenging off-policy sampling and nonlinear function approximation setting.
Machine Learning, ICML
1 Introduction
Since their introduction in deep -networks (DQN) a decade ago (Mnih et al., 2013, 2015), target networks have become a common feature of state-of-the-art deep reinforcement learning algorithms (Lillicrap et al., 2016; Haarnoja et al., 2017, 2018; Fujimoto et al., 2018). Theoretical analysis of target networks has been limited and there has been no satisfactory explanation for their empirical success in stabilising policy evaluation algorithms. Whilst recent analysis has characterised the convergence properties of policy evaluation using target networks (Lee & He, 2019; Fan et al., 2020; Zhang et al., 2021), existing approaches focus on asymptotic results, and usually make simplifying assumptions that neither hold in practice nor account for the true behaviour of target network-based updates. Our work finds that the use of target networks can guarantee that deep RL algorithms will not diverge, even in regimes where traditional RL algorithms fail. Additionally, we establish the first finite-time performance bounds for target networks and general function approximation—without strong simplifying assumptions. Moreover, we prove our key stability assumption can always be satisfied by augmenting our updates with simple regularisation that does not change the TD fixed points. In doing so, we finally provide theoretical justification for the empirical success that has been observed in challenging, off-policy tasks.
To achieve this, we analyse the use of infrequently updated target value functions by characterising them as a family of methods that we refer to as partially fitted policy evaluation (PFPE). This variant bridges the gap between fitted policy evaluation (FPE) (Le et al., 2019)—which iteratively fit the Bellman backups onto the class of representable function approximators —and classic temporal difference (TD) algorithms (Sutton, 1988) by limiting the fitting phase to a fixed number of steps, precisely reflecting the periodically updated target network algorithms as used in practice.
To characterise the performance of PFPE, we express our algorithm–which has traditionally been viewed through the lens of two-timescale analysis–using a single update applied only to the target network parameters. We show that the stability of the algorithm is determined by analysing the eigenvalues of the Jacobian of this update. This formulation allows us to characterise both the limiting (asymptotic) and finite-time (non-asymptotic) convergence properties of PFPE. Furthermore, it suggests, counterintuitively, that target networks are actually the object being optimised rather than merely a means to stabilise conventional TD updates. This insight leads us to empirically investigate a novel target parameter update scheme that uses a momentum-style update (Polyak, 1964), setting the stage for future research of practical target-based algorithms.
Our bounds on the finite-time performance of PFPE apply to off-policy, nonlinear and partially fitted methods, which have never been investigated previously. We develop key insights into the usefulness of target networks, which we find do not improve asymptotic performance when decaying step sizes are used. Instead, target networks improve the conditioning of TD and fitted methods when the step size does not tend to zero, as is often implemented in practice. Under non-decaying stepsizes, our Jacobian analysis shows how PFPE reconditions the TD Jacobian allowing us to prove convergence in regimes where classic TD methods are unstable, thereby breaking the so-called deadly triad that has plagued TD methods (Sutton & Barto, 2018). Furthermore, our results do not depend on unwieldy assumptions or modifications of algorithms used in practice, such as projection, bounded state spaces, linear function approximation, or iterate averaging, as is done in previous analysis. In addition to our theoretical results, we experimentally evaluate our bounds on a toy domain, indicating that they are tight under relevant hyperparameter regimes. Taken together, our results lead to novel insight as to how exactly target networks affect optimisation, and when and why they are effective, leading to actionable results that can be used to further future research.
2 Preliminaries
Proofs for all theorems, propositions and corollaries can be found in Appendix B
We denote the set of all probability distributions on a set as . We use to denote the -norm. For a matrix , we denote the set of eigenvalues as with the set of maximum normed eigenvalues as and . The -norm (spectral norm) for matrix is . Given a function and a distribution , we denote the -norm as: .
2.1 Reinforcement Learning
We consider the infinite horizon discounted RL setting. The agent interacts with an environment, formalised as a Markov Decision Process (MDP): with state space , action space , transition kernel , initial state distribution , bounded stochastic reward kernel where and scalar discount factor . An agent in state taking action observes a reward . The agent’s behaviour is determined by a policy that maps a state to a distribution over actions: and the agent transitions to a new state . We denote the joint distribution of conditioned on for policy as . We seek to optimise (in the control case), or estimate (in the policy evaluation case) the expected discounted sum of future rewards starting from a given state . This quantity is given by the state value function, , with , the action value function, given recursively through the Bellman equation: , where the Bellman operator projects functions forwards by one step through the dynamics of the MDP:
[TABLE]
is a -contractive mapping and thus has a fixed point, which corresponds to the true value of (Puterman, 2014). When estimating MDP values, we employ a value function approximation parametrised by .
Many RL algorithms employ TD learning for policy evaluation, which combines bootstrapping, state samples and sampled rewards to estimate the expectation in the Bellman operator (Sutton, 1988). In their simplest form, TD methods update the function approximation parameters according to:
[TABLE]
where , is a sampling distribution, and is a sampling policy that may be different from the target policy . For simplicity of notation and to accommodate the introduction of target networks in Section 3, we define the tuple with distribution and the TD-error vector as:
[TABLE]
allowing us to write the TD parameter update as:
[TABLE]
We make the following i.i.d. assumption for clarity of exposition, but discuss other sampling regimes in Appendix D:
Assumption 1**.**
Each is drawn i.i.d..
Typically, is the steady-state distribution of an ergodic Markov chain. We denote the expected TD-error vector as: and define the set of TD fixed points as:
[TABLE]
If a TD algorithm converges, it converges to a TD fixed point. Convergence of TD methods can only be guaranteed for linear function approximators when sampling on-policy in an ergodic MDP, that is the agent sampling and target distributions are the same. We investigate the phenomenon further as part of our asymptotic analysis in Section 4.1.
3 Partially Fitted Policy Evaluation
Unfortunately, real-world applications of RL often demand the expressiveness of nonlinear function approximators like neural networks and/or the ability to use data that has been collected off-policy, i.e., by following a policy that differs from the target policy for policy evaluation.
3.1 Fitted v Partially Fitted Policy Evaluation
Fitted methods improve on the sample efficiency and stability of TD methods by explicitly incorporating the limitations of the function approximation class through the use of a projection operator (Tsitsiklis & Van Roy, 1997). These methods generally perform some variant of the iterate where is the projection operator . These updates are known as fitted policy evaluation (PFE).
The projection step is needed to accommodate the fact that values generally cannot be exactly represented with function approximation. To obtain a practical way of carrying out the PFE updates, a separate set of target parameters can be introduced that parameterise the TD target and are updated every timesteps:
[TABLE]
The function approximator update in Equation 7 carries out iterations of stochastic gradient descent (SGD) on the loss:
[TABLE]
before updating the target parameters. In the limit as , assuming convergence of SGD to a global minimum, fully fitted policy evaluation occurs by finding .
In practice is finite and only partial policy evaluation occurs before updating the target parameters, a setting we call partially fitted policy evaluation (PFPE). Without loss of generality, we assume that is deterministic with and for all , that is stepsizes only change after updating target parameters. As the target parameters are updated to the approximator parameters every timesteps in Equation 8, it suffices to consider the target parameter update in isolation when analysing PFPE. Our goal is thus to analyse a single update for the target parameters in the canonical form:
[TABLE]
where is a set of samples from the environment with distribution and reduces the nested updates from Equation 7 into a single update for the target parameters.
3.2 Jacobian Analysis
In our analysis, we show that the stability of the expected PFPE update is determined by the conditioning of three Jacobians. We denote the Hessian of the loss as: , the Jacobian of the TD-error vector as: and define the TD Jacobian as: . Observe that . Without loss of generality, we assume that the Hessian matrix is diagonalisable because, if it is not, an arbitrarily small perturbation can make its eigenvalues distinct and therefore diagonalisable. So that these matrices exist, we require that the expected PFPE update is differentiable almost everywhere, a condition that is guaranteed by a Lipschitz assumption. We also require that the variance of the updates is bounded, motivating the following regularity assumption:
Assumption 2** (Function Approximator Regularity).**
We assume that is Lipschitz in with constant : and is convex, for some .
The bounded variance assumption can easily be achieved for unbounded function approximators by truncating the TD error vector, much like the commonly used gradient clipping in gradient descent. We now introduce the path-mean Jacobians, which are the principal element of our analysis:
[TABLE]
Intuitively, a path-mean Jacobian is the average of all of the Jacobians along the line joining to . The convexity assumption in 2 ensures that the line integral joining any two points in always exists. The Lipschitz assumption in 2 is only required for Section 4 and can be weakened to any condition that ensures the path-mean Jacobians exist for the remainder of the paper.
Our analysis in Section 4 proves that stability of TD and PFPE under decaying stepsizes is determined solely by the negative definiteness of the TD path-mean Jacobian . In Section 5, we show for a non-diminishing stepsize regime that through suitable regularisation (which does not affect the TD fixed point), PFPE’s stability can be determined only by and , for which stable values exists. As is the path-mean Hessian of the loss, convergence can be guaranteed under the same mild assumptions required to prove convergence of a stochastic gradient descent algorithm to minimise . This implies that PFPE can converge under regimes where TD will not as is positive definite.
3.3 Analysis of PFE
We now showcase the power of our Jacobian analysis by writing the PFE updates exactly in terms of :
Theorem 1**.**
Under 2, the sequence of PFE updates satisfy:
[TABLE]
We can use Theorem 1 to determine the stability of FPE updates. If then the FPE updates are a contraction mapping and will converge to a fixed point under the Banach fixed-point theorem. We discuss the convergence of FPE under varying regularisation schemes in Section 5.1.
4 Asymptotic Analysis
We now study the behaviour of Equation 10 in the limit of . We introduce the standard Robbins-Munro condition for the decaying stepsizes that is a necessary condition to ensure convergence to a fixed point:
Assumption 3** (Robbins-Munro).**
Each is a positive scalar with and .
Now we introduce a core necessary assumption to prove stability of PFPE with diminishing stepsizes:
Assumption 4** (TD Stability).**
There exists a region containing a fixed point such that has strictly negative eigenvalues for all .
The key insight from 4 is that the stability of PFPE under diminishing stepsizes is determined only by the eigenvalues of the single step path-mean Jacobian , regardless of the value of or . Indeed, stochastic approximation can be shown to be provably divergent if this condition cannot be satisfied (Pemantle, 1990). From this perspective, if TD diverges then so will PFPE under diminishing stepsizes, hence the asymptotic stability of PFPE is independent of and , and, unlike updating under a two-timescale regime, introducing target parameters that are updated periodically every timesteps does not improve asymptotic convergence properties under this analysis. Once 4 has been established, there are several approaches to prove convergence of the PFPE update under varying sampling conditions and projection assumptions. We follow the proof of (Vidyasagar, 2022), but discuss approaches that generalise our assumptions in Appendix D
Theorem 2**.**
Let Assumptions 1 to 4 hold. If there exists some fixed point with region of contraction and timestep such that for all the the sequence of target parameter updates in Equation 8 converge almost surely to .
4.1 The Deadly Triad
We have established that it is not possible to prove convergence of PFPE under diminishing stepsizes if 4 does not hold. We now discuss how adherence to 4 formalises a phenomenon known as the deadly triad (Sutton & Barto, 2018) where it has been established that TD cannot be proved to converge when using function approximators in the off-policy setting. To control for the effect of nonlinear function approximation, we first investigate linear function approximators of the form where is a feature vector. Define the one-step lookahead distribution as: . Introducing the shorthand:
[TABLE]
we can derive the TD Jacobian as:
[TABLE]
We now examine why the conditioning of explains this phenomenon.
Linear Function Approximation
For linear function approximators, we show in Section A.1 that for all is a sufficient condition for to have negative eigenvalues, thereby satisfying 4. This implies that the function approximator class remains non-expansive under the one-step lookahead distribution , thereby preventing the function approximator diverging as the Markov chain is traversed. This condition has been introduced previously in the fitted -iteration literature (Wang et al., 2020, 2021) as a “low distribution shift” assumption.
In the on-policy setting in an ergodic MDP, we can prove that there exists a stationary distribution induced by following the target policy , that is . Moreover it is assumed that samples come from ; hence by the definition of ergodicity, the one-step lookahead distribution is the stationary distribution: . It thus follows that and hence 4 holds automatically for on-policy TD in an ergodic MDP, thereby establishing the convergence properties as a special case via Theorem 2.
For off-policy data, it is not possible to prove that holds without further assumptions on the sampling policy and MDP. In general, it is not possible to show that is negative definite in the off-policy case as the distribution shift may be too high: there exist counterexample MDPs where off-policy algorithms such as -learning provably diverge under linear function approximation (Williams & Baird, 1993; Baird, 1995a).
Nonlinear Function Approximation
Even in an on-policy regime, we cannot prove convergence of TD when nonlinear function approximators such as neural networks are used. In these cases, the path-mean Jacobian may not have a closed form solution. However, it can be bounded by the following norm (see Section A.2):
[TABLE]
Even making the same assumption as in Section 4.1 of sampling on-policy in an ergodic MDP to show that
[TABLE]
we cannot prove the negative definiteness of required to satisfy 4. This is because the matrix can be arbitrarily positive definite depending on the MDP and choice of function approximator. Indeed, there exist counterexample MDPs with provably divergent nonlinear function approximators when sampling on-policy (Tsitsiklis & Van Roy, 1997).
5 Non-asymptotic Analysis
Our asymptotic analysis in Section 4 shows that increasing or adjusting for PFPE does not affect the asymptotic strong convergence properties of the TD algorithm, implying that target networks do not stabilise TD if stepsizes tend to zero. We showed that the underlying reason for this was the deadly triad, which we formalised as adherence to 4. We now replace 4, that is is negative definite, with the assumption that FPE is stable:
Assumption 5** (FPE Stability).**
There exists a region containing a fixed point such that .
5.1 Stabilising FPE
We now prove that 5 can always be satisfied using regularisation schemes that do not affect the TD fixed points. We introduce the following regularised TD vector:
[TABLE]
where is a regularisation term such that , thereby not changing the TD fixed point or TD update. As an example, can contain powers of regularisation terms in addition to combinations of and terms, where is a TD vector with target and -network parameters swapped. In this paper, we briefly study regularisation of the form:
[TABLE]
where mixes the TD updates and controls the degree of regularisation. We emphasise that , leaving the TD update unchanged. In contrast, unless is known a priori, introducing regularisation that modifies the TD update—as is done in (Zhang et al., 2021)—will affect the TD fixed points. We now prove that FPE can be stabilised by treating and as hyperparameters to be tuned to the specific MDP.
Proposition 1**.**
Using the regularised TD vector in Equation 26, the path-mean Jacobians are:
[TABLE]
5* is satisfied if:*
[TABLE]
There exists finite such that Equation 31 holds.
The key insight from Proposition 1 is that regularisation stabilises FPE (and hence PFPE) without affecting existing TD fixed points, even when TD is unstable, motivating future research directions to develop sophisticated regularisation techniques.
5.2 Convergence Analysis
By carrying out a non-asymptotic analysis, we now investigate how the deadly triad can be broken by PFPE using Equation 24 when stepsizes do not tend to zero. This leads to a formal understanding of how target parameters stabilise TD under stepsize regimes that are actually used in practice when classic TD methods fail. The foundation of our analysis is a condition function that can be used to determine the stability of the updates:
Definition 1** (Condition Function).**
For a subset with corresponding fixed point such that for all , let
[TABLE]
and define the condition function as:
[TABLE]
The condition function depends on the maximal eigenvectors of the Jacobians introduced in Section 3.2, and so can still be used to analyse general nonlinear function approximators for which the path-mean Jacobians have no analytic solution. Using the condition function, we decompose the error at a given timestep into the effect of the expected update plus the error induced by variance of the update:
Theorem 3**.**
Define
[TABLE]
Let Assumptions 1 and 2 hold, then:
[TABLE]
The effect of the expected update (the first term in Equation 38) is bounded by the condition function, which depends both on data conditioning but critically, on both and as well and must diminish with increasing to ensure convergence. Using this decomposition, we see convergence is guaranteed if the following assumption holds:
Assumption 6** (Contraction Region).**
We assume that over .
allowing us to prove convergence of PFPE for stepsizes that don’t tend to zero provided that updates remain in a region of contraction:
Corollary 3.1**.**
Let Assumptions 1, 2, 5 and 6 hold. For a fixed stepsize ,
[TABLE]
Corollary 3.1 is a key result of this work. Our result demonstrates geometric decay of errors in , to a ball of fixed radius . This is analogous to related work in stochastic gradient descent (Bottou et al., 2018), and matches the intuition that, without decaying stepsize, variance in the updates means that convergence to a fixed point does not occur. Note that the radius of the ball which we converge to can be made arbitrarily small by decreasing .
This supports the use of a hybrid approach, wherein a fixed step size is used until iterates are no longer improving and then reducing step size and repeating to decrease the radius of the ball of convergence whilst maintaining as small as possible. In the remainder of this section, we explore the properties of the condition function to ensure the existence of a region of contraction satisfying 6.
5.3 Properties of PFPE Condition Function
We now investigate key properties of Equation 36 to understand how target parameters can lead to convergence when classic TD methods fail. If is positive definite, TD is provably divergent, however our analysis reveals that there are values of and for which PFPE does converge.
Property 1: Lower bound
.
We first investigate the conditions for which our choice of function approximators can never be used to prove convergence. Our condition function implies that we cannot prove convergence for any or as repeated applications of do not reduce the effect the ill-conditioning of . We formalise this in the following regularity assumption:
Assumption 7** (Eigenvalue Regularity Assumption).**
Given a region , for all there exists and such that .
We now propose two simple fixes to avoid this issue. Recall from Section 3.2 that is an eigenvalue of the Hessian of a loss. If was negative, this would imply that the Hessian is not positive semidefinite for all in the region of interest; hence we cannot prove convergence of stochastic gradient descent on the loss , let alone the full PFPE algorithm. To remedy this problem, the eigenvalues of the matrix can be increased using the regularisation introduced in Equation 24 without affecting the TD fixed point. However, if , then the conditioning of the Hessian matrix is ill-suited to the chosen step-size, and an easy remedy is to decrease . Our bound shows that the condition function is lower bounded by , and so if 5 does not hold, then convergence of PFPE is not provable.
Property 2: Monotonicity
For , for .
The monotonicity property ensures that defines the interval of Hessian eigenvalues for which there is a regime in which we can increase in order to ensure PFPE updates are a contraction mapping. This suggests that a key role of the target network is to help mitigate the effects of the ill-conditioning of the TD Jacobian when using fixed step sizes. We now investigate how decreasing stepsizes and increasing the number of PFPE steps affect the conditioning of PFPE, which validates this hypothesis.
Property 3: Limits
For any , . For any , .
The first limit illustrates the effects of a diminishing stepsize sequence, confirming our bound is consistent with the results of the previous section that increasing does not improve the convergence properties of PFPE if stepsizes tend to zero and PFPE only stabilises TD for . By taking the limit , we compliment our monotonicity result, obtaining a bound for how much we can improve on the stability of TD by increasing . As expected, in the limit of , the condition function tends to . Through this insight, we interpret PFPE as mixing FPE and TD updates according the coefficient : for , PFPE uses only TD updates and in the limit , PFPE recovers the FPE update.
5.4 Breaking the Deadly Triad
We now combine all properties presented in this section into our main result, proving that through suitable regularisation and choice of and , PFPE breaks TD’s deadly triad described in Section 4.1:
Theorem 4**.**
Let 7 hold over from Definition 1. For any such that , any
[TABLE]
ensures that is a region of contraction satisfying 6.
Theorem 4 demonstrates that appropriate values of and can be found by treating them as hyperparameters, decreasing and increasing until the algorithm is stable, reducing the conditions needed to prove convergence of PFPE to those of proving convergence of stochastic gradient descent on the loss . The key insight of Theorem 4 is that even when TD is unstable due to , there exists a finite such that and hence PFPE is stable. We illustrate this phenomenon with a sketch in Figure 1, demonstrating that increasing ensures PFPE is provably convergent in regimes where TD cannot be proved to converge.
The key insight of our analysis is that, unlike in TD where stability can only be proved if the matrix is negative definite, with suitable regularisation, the stability of PFPE can be determined solely by tuning and , regardless of the MDP, sampling regime, or function approximator, thereby breaking the deadly triad. The choice of and thus becomes a trade-off between maintaining a fast rate of convergence and reducing the residual variance in Equation 38.
6 Related Work
Our work furthers the analysis of TD, FPE, and target-network based methods. In this section we provide a brief overview of previous investigations of these algorithms.
Fitted Policy Evaluation
FPE is a relatively well understood class of RL algorithms from a theoretical perspective. Nedić & Bertsekas (2003) analyse the convergence of the Least-Squares Policy Evaluation (LSPE) of Bertsekas & Ioffe (1996) in an on-policy, linear function approximation setting. Analysis of LSPE shows that learning with constant step size leads to theoretical and empirical gains compared to TD and LSPE with decaying step sizes (Bertsekas et al., 2004), which mirrors our conclusions in Section 5.4.
In the context of fitted methods applied to off-policy and control problems, Munos & Szepesvári (2008) prove generalisation properties of Fitted Iteration (Ernst et al., 2005) for general function classes under assumptions of low projection error and limited data distribution shift. Le et al. (2019) coin the term FPE, and formalise the algorithm for general function approximators, with theoretical results under similar assumptions to Munos & Szepesvári (2008).
Theory of TD
Previous results concerning convergence rates of classic TD methods largely argue that the Bellman operator is a contraction, and thus most focus on linear function approximation. Tsitsiklis & Van Roy (1997) first proved convergence of linear, on-policy TD, arguing that the projected Bellman operator in this setting is a contraction. This corresponds to a special case of 4. Dalal et al. (2017) give the first finite time bounds for linear TD(0), under an i.i.d. data model similar to the one that we use here. Bhandari et al. (2018) provide bounds for linear TD in both the i.i.d. data setting and a correlated data setting, through analogy with SGD. Srikant & Ying (2019) approach the problem from the perspective of Ordinary Differential Equations (ODE) analysis, bounding the divergence of a Lyapunov function from the limiting point of the ODE that arises from the TD update scheme.
Analysis of Target Networks
Existing analysis of the theoretical properties of target networks are limited, usually involving algorithmic changes or restrictive assumptions. Yang et al. (2019) show convergence of a -learning approach using a target network that is updated using Polyak averaging with nonlinear function approximation. However their analysis–which makes use of two-timescale analysis–requires a projection step to limit the magnitude of parameters. Carvalho et al. (2020) show convergence of a related method using two-timescale analysis, though their target network update differs significantly from those used in practice. Zhang et al. (2021) analyse the use of target networks with linear function approximation, but require projection steps on both the target network and value parameters. Lee & He (2019) provide finite-iteration bounds, but are limited to on-policy data, linear function approximation, and near-perfect fitting to the target network between updates. Fan et al. (2020) analyse the use of target networks for deep Q-learning (Mnih et al., 2015) with the simplifying assumption that they are performing some form of Fitted Iteration.
None of these efforts yield finite time bounds with target networks, nor do any match the policy evaluation methods used in practice as well as the PFPE analysis studied here. Furthermore, our use of a single target network update, rather than independent target and value updates leads to simpler bounds without the need for a two-timescale analysis.
GTD and TDC Methods
While not directly related to PFPE or the use of target networks, GTD-style approaches (Sutton et al., 2008, 2009; Maei et al., 2009) also lead to convergent, TD-style algorithms, even with off-policy sampling or nonlinear function approximation. These methods maintain a second set of parameters which must be optimised at a faster timescale than the value parameters. However, these approaches are commonly found to be ineffective and not used in practice due to the difficulty in tuning the rate of second timescale (see, e.g. Fellows et al. (2021)), and potentially additional variance introduced by the second set of parameters (Ghiassian et al., 2020).
Improving Conditioning of TD Methods
Previous work concerning conditioning of TD methods has been largely concerned with approximation of preconditioning approaches to iterative-methods (Saad, 2003). The first such approach was focused on preconditioning of on-policy, linear, least-squares forms of TD (Yao & Liu, 2008). Chen et al. (2020); Romoff et al. (2020) adapt this approach for nonlinear function approximation, though their results are still on-policy. Our work, on the other hand, demonstrates that use of the target network, alongside fixed step sizes, changes the form of parameter iterates to ameliorate the poor conditioning that occurs when directly applying TD or fitted methods, even in off-policy settings.
7 Experiments
We proceed to empirical investigation of our bounds. First, we demonstrate that the use of an infrequently updated target network leads to convergence of off-policy evaluation on the Baird’s notorious counterexample. Then, we evaluate the effect of a speculative modified update rule in the Cartpole-v0 “gym” environment (Brockman et al., 2016). Additional implementation details for both experiments can be found in Appendix C.
7.1 Baird’s Counterexample
In this experiment, we demonstrate the practicality of our core claim–that for sufficiently high and low enough , PFPE will not diverge, even under conditions that TD does. To do so, we evaluate the use of target networks with varying update frequencies on the well known off-policy counterexample due to Baird (1995b).
In this environment, depicted in Appendix C, rewards are zero everywhere, transitions are deterministic, and the true solution lies within the linear function approximation class that we make use of. The behaviour policy is set such that all states are sampled with uniform probability. The target policy, however, always transitions to a specific state, and remains there. Due to undersampling of this absorbing state, conventional TD policy evaluation diverges, demonstrating that even in simple environments, TD can be unstable when applied off policy with function approximation.
We report the stepwise (fitted) error in Figure 2 across different values of , for fixed step size , and fixed discount factor . We see that with –which is equivalent to using TD with fixed step sizes–our parameters diverge. Likewise, if is set to 5 or 10, we are unable to overcome the conditioning of the TD Jacobian and diverge, albeit at a slower rate. Once we take , however, conditioning has improved enough to lead to convergence. This supports our theoretical conclusion: that PFPE can be used to improve the convergence conditions of TD.
7.2 Cartpole Experiment
One important insight of our analysis is that we can view the entire optimisation process as a sequence of updates to the target network only. This suggests investigation into alternative forms or acceleration of target network updates. Inspired by the use of optimisation methods with momentum in RL settings (Sarigül & Avci, 2018; Haarnoja et al., 2018), we investigate the effects of a target network that is updated using momentum.
Unlike the standard periodic target network update in Equation 8, we postulate that there may be settings in which a periodic update with momentum may accelerate or stabilise convergence. This update works as follows:
[TABLE]
We investigate the effects of this momentum update on the Cartpole domain. For this experiment, we use control results in which the policy is continuously learned. This is because control problems are inherently off-policy, and induce additional instability, and thus benefit from faster and more stable convergence of values. We implement the standard DQN (Mnih et al., 2015) algorithm, with our modified target network update in order to examine its effect. The results are shown in Figure 3. Our proposed update indeed leads to improved learning and stability, at least for the hyperparameter ranges tested, suggesting that the momentum update has merit. As a result, we propose investigation of more sophisticated target network update schemes as an avenue for future research.
8 Conclusions
This work analysed the use of target networks through the formulation of a novel class of TD updates, which we refer to as PFPE. These updates generalise traditional TD(0) and fitted policy evaluation methods. Our analysis contributes asymptotic and finite time bounds without additional restrictive assumptions or significant changes to the algorithms used in practice. In our main result, we uncovered novel insight as to when and how target networks are useful: provided step-sizes don’t tend to zero and FPE is stable, there always exists a finite number of update steps and non-zero upper bound over stepsizes such that PFPE can improve conditioning to ensure learning is stable when classic TD methods fail. Our focus on the target network update as the object of concern in terms of optimisation suggests that novel, accelerated methods for updating target networks may help speed up and stabilise learning. Our initial experiments support this notion. Moreover, our analysis reveals that regularisation may be key to determining the stability of PFPE, opening a promising avenue for future research.
Acknowledgements
Mattie Fellows is funded by a generous grant from Waymo. We would like to thank Valentin Thomas for providing a helpful discussion.
Appendix A Derivations
A.1 Derivation of 4 from low distributional shift
Starting from 4 and the definition of negative definiteness, we need to show:
[TABLE]
whenever , for all . Investigating the first term by expanding the expectations we see:
[TABLE]
This allows us to apply our assumption:
[TABLE]
A.2 Nonlinear Jacobian Analysis
We start by bounding the maximum eigenvalue:
[TABLE]
We now substitute for the definition of the TD Jacobian, yielding:
[TABLE]
as required.
Appendix B Proofs
B.1 FPE Analysis
Lemma 1**.**
Under 2, the FPE update satisfies:
[TABLE]
Proof.
Given , the FPE fixed point must be an element of the set:
[TABLE]
which we use to derive a stability condition for the projection operator:
[TABLE]
Let and . We introduce the notation:
[TABLE]
We observe that and , and and . From the fundamental theorem of calculus and 2, it follows:
[TABLE]
as required. ∎
Theorem 1**.**
Under 2, the sequence of FPE updates satisfy:
[TABLE]
Proof.
From Equation 61 of Lemma 1, it follows:
[TABLE]
Recursively applying the result times, our result follows immediately. ∎
B.2 Asymptotic Analysis
For this section, we define a Martingale difference sequence that captures the behaviour of our updates. Let denote the intermediate function approximation parameters between target parameter updates and , with and . We start by writing our target parameter updates as:
[TABLE]
where we define recursively as:
[TABLE]
and remark that trivially. We write our target parameters updates as:
[TABLE]
where
[TABLE]
and defines the Martingale sequence:
[TABLE]
In this section, we demonstrate that the proof of Borkar & Meyn (2000, Theorem 2.2) can be adapted to account for the additional term that arises due to the use of target networks in the updates. Lemma 2 demonstrates that as stepsizes tend to zero, the effect of becomes negligible, hence the inclusion of negligible to our analysis of the underlying ODE defined by the TD updates.
Lemma 2**.**
Let for . Under Assumptions 1 to 3, almost surely.
Proof.
We start by bounding each using the the Lipschitzness of from 2:
[TABLE]
To proceed, we recognise that each almost surely where is a finite positive constant - otherwise:
[TABLE]
for at least one , hence for some thereby violating 2. Using , we bound :
[TABLE]
almost surely. We use this result to bound :
[TABLE]
Now, under 3,
[TABLE]
hence by the bound established in Equation 95:
[TABLE]
almost surely, as required. ∎
Theorem 2**.**
Under Assumptions 1- 4, the sequence of target parameter updates in Equation 8 converge almost surely to .
Proof.
Our update
[TABLE]
is identical to the update presented in Borkar & Meyn (2000, Eq. 2.1.1) with an additional term . Proof of convergence to the ODE is given by Borkar & Meyn (2000, Lemma 1), which is predicated on the convergence of:
[TABLE]
from Borkar & Meyn (2000, Eq. 2.1.6) where
[TABLE]
for , that is , almost surely. To adapt our updates so that Borkar & Meyn (2000, Lemma 1) still applies, we recognise that the term is now replaced in our updates with:
[TABLE]
and hence is replaced in our updates with:
[TABLE]
where is defined as Lemma 2. All arguments of Borkar & Meyn (2000, Lemma 1) remain unchanged, except Eq. 2.1.9, where we must now show that :
[TABLE]
Applying Lemma 2 yields almost surely, hence
[TABLE]
which is proved in Borkar & Meyn (2000, Lemma 1). Convergence of our algorithm is thus only predicated on the convergence of the update:
[TABLE]
Borkar & Meyn (2000, Theorem 2.2) proves convergence of Equation 111 almost surely to given the following four conditions hold:
- I
is Lipschitz in , 2. II
Stepsizes satisfy 3, 3. III
The sequence is a Martingale difference sequence with respect to the increasing family of -algebras: where and for some positive . 4. IV
The sequence of iterates remain bounded, that is almost surely.
Conditions I and II hold trivially.
For Condition III, we can take expectations of the Martingale difference:
[TABLE]
as required. We now show that the variance is bounded using 2:
[TABLE]
thereby satisfying Condition III.
Finally, we prove Condition IV using Vidyasagar (2022, Theorem 5), which states iterates remain bounded almost surely if:
- (a)
Conditions I and III hold; 2. (b)
there exists some Lyapunov function such that for constants and is bounded, and; 3. (c)
for all .
We propose as a candidate Lyapunov function, which trivially satisfies (b). We now show (c) holds by applying the fundamental theorem of calculus to . Let . Like in Theorem 1, it follows:
[TABLE]
hence:
[TABLE]
for all under 4, as required. ∎
B.3 Stabilising FPE
Proposition 1**.**
Using the regularised TD vector in Equation 26, the path-mean Jacobians are:
[TABLE]
5* is satisfied if:*
[TABLE]
There exists finite such that Equation 130 holds.
Proof.
Taking derivatives of :
[TABLE]
For clarity, we drop arguments of and from our notation.
[TABLE]
We note that can always be made non-singular (and hence invertible) through an arbitrarily small change in , allowing us to multiply the first term by , yielding:
[TABLE]
We observe that:
[TABLE]
hence taking limits yields:
[TABLE]
From the continuity of the norm, it follows:
[TABLE]
implying that for any , , which it suffices assume for hereon out. From the definition of the limit, there exists some finite such that
[TABLE]
for all for some small , and hence
[TABLE]
for all , as required. ∎
B.4 Nonasymptotic Analysis
Lemma 3**.**
Under 2, for the expected updates can be factored as:
[TABLE]
and for :
[TABLE]
Proof.
By the definition of the expected update :
[TABLE]
Like in Theorem 1, let define the line connecting to . Using this notation we re-write the expected update as:
[TABLE]
Applying the fundamental theorem of calculus under 2 and the chain rule yields our desired result:
[TABLE]
Our second result follows immediately:
[TABLE]
For our final result:
[TABLE]
By the definition of the expected update:
[TABLE]
Let define the line connecting to . Using this notation we re-write the expected update as:
[TABLE]
Applying the fundamental theorem of calculus under 2 and the chain rule yields our desired result:
[TABLE]
∎
Lemma 4**.**
Under 2,
[TABLE]
Proof.
We start by bounding the expected norm term using Jensen’s inequality: :
[TABLE]
where we applied the triangle inequality to derive the final line. We bound the variance term by substituting :
[TABLE]
Applying Lemma 3 to the expectation and using the triangle inequality yields our desired result:
[TABLE]
∎
Theorem 3**.**
Define
[TABLE]
Let Assumptions 1 and 2 hold, then:
[TABLE]
Proof.
Let denote the intermediate function approximation parameters between target parameter updates and , with and . We define the set of samples up to as: with distribution , with sample having distribution . Under this notation, we must show:
[TABLE]
Applying Lemma 4 to the inner expectation:
[TABLE]
Applying Equation 178 from Lemma 4 to the inner expectation and applying Lemma 3 yields:
[TABLE]
Recursively applying Equation 195 to Equation 189 times yields:
[TABLE]
Now, applying Equation 178 and Lemma 3 to the expectation:
[TABLE]
Substituting into Equation 195:
[TABLE]
Finally, we apply Theorem 1 to yield our desired result:
[TABLE]
∎
Corollary 3.1**.**
Let Assumptions 1, 2, 5 and 6 hold. For a fixed stepsize . For a fixed stepsize ,
[TABLE]
Proof.
We start by applying Theorem 3:
[TABLE]
As is a region of contraction and for all , there exists a positive under 6 such that , hence:
[TABLE]
Now, for a fixed constant stepsize , we can apply Equation 211 times, yielding:
[TABLE]
Now we apply the bound , yielding our desired result:
[TABLE]
∎
B.5 Breaking the Deadly Triad
Theorem 4**.**
Let 7 hold over from Definition 1. For any such that , any
[TABLE]
ensures that is a region of contraction satisfying 6.
Proof.
Now, as is a symmetric function of with a minima at and is the mid point of and , it follows:
[TABLE]
Now,
[TABLE]
hence
[TABLE]
Let where . From the definition of a limit, this implies that for there exists some finite such that whenever :
[TABLE]
as required. To find the value of for which , we set and solve:
[TABLE]
∎
Appendix C Additional Experiment Information
For both plots, each configuration was run over random seeds, with the central tendency given by the mean, and the shaded errors representing the standard error of the mean. Hyperparameters that are not varied in the plots were optimised by grid search across either linear or logarithmic hyperparameter ranges, as is suitable. Parameters were chosen that led to the highest performance as averaged across random seeds, then relevant hyperparameters were varied, using the optimal fixed hyperparameters. Hyperparameters that were varied are denoted as lists in the tables below.
C.1 Baird’s Counterexample
Figure Figure 4 shows the counterexample. The behaviour policy chooses between the action represented by the wavy line with probability , and the solid line with probability 1/7. The behaviour policy always chooses the solid line. The linear function approximation scheme is shown in terms of the value function weights. Sampling off policy in this way leads to divergence of TD, but PFPE converges, as seen in Figure 2.
C.2 Cartpole Experiment
For the Cartpole experiment, we use a simple DQN-style setup with a small multilayer perceptron (MLP) representing the value function. A small adjustment is made from PFPE as characterised by the paper. Instead of updating value parameters on single data points, parameter updates are averaged across a small batch. This was found to increase stability of learning in both settings, with no notable effects when comparing across independent variables. This means that, in addition to our target network, we also make use of a replay buffer which stores observed transitions. As such, data used in updates was sampled uniformly from previous transitions. The policy was -greedy, with the estimated optimal action taken with probability . The environment is maintained by OpenAI as part of the gym suite, and falls under MIT licensing.
Appendix D Extensions
As discussed in Section 4, once we can establish 4 then there are several theoretical tools that become applicable from stochastic approximation to prove convergence under a range of assumptions. Brooms (2006) provide a comprehensive overview of classic methods. In particular, stochastic approximation has been shown to converge when sampling from an ergodic Markov chain under specific regularity assumptions (Allasonniere et al., 2010). Perhaps the easiest to verify in our context is those of Andrieu et al. (2005), who provides a series of assumptions that can be checked in practice. Moreover, this theory was recently extended to Markov chains that converge sub-geometrically to their station distributions by Debavelaere et al. (2021). Adherence of the updates to remain in a contractive region can be ensured by projection into an ever increasing subset of until convergence occurs, which is detailed and analysed in Andradottir (1991).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Allasonniere et al. (2010) Allasonniere, S., Kuhn, E., and Trouve, A. Construction of bayesian deformable models via a stochastic approximation algorithm: A convergence study. Bernoulli , 16(3):641–678, 2010. ISSN 13507265. URL http://www.jstor.org/stable/25735007 .
- 2Andradottir (1991) Andradottir, S. A projected stochastic approximation algorithm. In 1991 Winter Simulation Conference Proceedings. , pp. 954–957, 1991. doi: 10.1109/WSC.1991.185710 .
- 3Andrieu et al. (2005) Andrieu, C., Moulines, E., and Priouret, P. Stability of stochastic approximation under verifiable conditions. SIAM Journal on Control and Optimization , 44(1):283–312, 2005. doi: 10.1137/S 0363012902417267 . URL https://doi.org/10.1137/S 0363012902417267 . · doi ↗
- 4Baird (1995 a) Baird, L. Residual algorithms: Reinforcement learning with function approximation. In Prieditis, A. and Russell, S. J. (eds.), Proceedings of the Twelfth International Conference on Machine Learning (ICML 1995) , pp. 30–37, San Francisco, CA, USA, 1995 a. Morgan Kauffman. ISBN 1-55860-377-8. URL http://leemon.com/papers/1995 b.pdf .
- 5Baird (1995 b) Baird, L. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995 , pp. 30–37. Elsevier, 1995 b.
- 6Bertsekas & Ioffe (1996) Bertsekas, D. P. and Ioffe, S. Temporal differences-based policy iteration and applications in neuro-dynamic programming. Lab. for Info. and Decision Systems Report LIDS-P-2349, MIT, Cambridge, MA , 14, 1996.
- 7Bertsekas et al. (2004) Bertsekas, D. P., Borkar, V. S., and Nedic, A. Improved temporal difference methods with linear function approximation. Learning and Approximate Dynamic Programming , pp. 231–255, 2004.
- 8Bhandari et al. (2018) Bhandari, J., Russo, D., and Singal, R. A finite time analysis of temporal difference learning with linear function approximation, 2018.
