GQ($\lambda$) Quick Reference and Implementation Guide

Adam White; Richard S. Sutton

arXiv:1705.03967·cs.LG·May 12, 2017

GQ($\lambda$) Quick Reference and Implementation Guide

Adam White, Richard S. Sutton

PDF

Open Access

TL;DR

This paper provides a quick reference and implementation guide for the linear GQ(λ) off-policy temporal-difference learning algorithm, including theoretical background and Java code for practical use.

Contribution

It offers a concise reference and implementation resources for GQ(λ), facilitating understanding and application of this gradient-based off-policy learning algorithm.

Findings

01

Provides a clear implementation guide for GQ(λ)

02

Includes Java code for practical use

03

Summarizes key theoretical aspects

Abstract

This document should serve as a quick reference for and guide to the implementation of linear GQ( $λ$ ), a gradient-based off-policy temporal-difference learning algorithm. Explanation of the intuition and theory behind the algorithm are provided elsewhere (e.g., Maei & Sutton 2010, Maei 2011). If you questions or concerns about the content in this document or the attached java code please email Adam White ([email protected]). The code is provided as part of the source files in the arXiv submission.

Equations12

ρ_{t} = \frac{π ( S _{t} , A _{t} )}{b ( S _{t} , A _{t} )},

ρ_{t} = \frac{π ( S _{t} , A _{t} )}{b ( S _{t} , A _{t} )},

\overset{ˉ}{ϕ}_{t} = a \in A \sum π (S_{t}, a) ϕ (S_{t}, a)

\overset{ˉ}{ϕ}_{t} = a \in A \sum π (S_{t}, a) ϕ (S_{t}, a)

δ_{t} = r (S_{t}, A_{t}, S_{t + 1}) + γ (S_{t + 1}) θ_{t}^{⊤} \overset{ˉ}{ϕ}_{t + 1} - θ_{t}^{⊤} ϕ (S_{t}, A_{t})

δ_{t} = r (S_{t}, A_{t}, S_{t + 1}) + γ (S_{t + 1}) θ_{t}^{⊤} \overset{ˉ}{ϕ}_{t + 1} - θ_{t}^{⊤} ϕ (S_{t}, A_{t})

θ_{t + 1} = θ_{t} + α [δ_{t} e_{t} - γ (S_{t + 1}) (1 - λ (S_{t + 1})) (w_{t}^{⊤} e_{t}) \overset{ˉ}{ϕ}_{t + 1}]

θ_{t + 1} = θ_{t} + α [δ_{t} e_{t} - γ (S_{t + 1}) (1 - λ (S_{t + 1})) (w_{t}^{⊤} e_{t}) \overset{ˉ}{ϕ}_{t + 1}]

w_{t + 1} = w_{t} + α η [δ_{t} e_{t} - (w_{t}^{⊤} ϕ (S_{t}, A_{t})) ϕ (S_{t}, A_{t})]

w_{t + 1} = w_{t} + α η [δ_{t} e_{t} - (w_{t}^{⊤} ϕ (S_{t}, A_{t})) ϕ (S_{t}, A_{t})]

e_{t} = I (S_{t}) ϕ (S_{t}, A_{t}) + γ (S_{t}) λ (S_{t}) ρ_{t} e_{t - 1}

e_{t} = I (S_{t}) ϕ (S_{t}, A_{t}) + γ (S_{t}) λ (S_{t}) ρ_{t} e_{t - 1}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Advanced Bandit Algorithms Research

MethodsAdam

Full text

GQ( $\lambda$ )

Quick Reference and Implementation Guide

Adam White and Richard S. Sutton

(Revised July 29, 2014)

This document should serve as a quick reference for and guide to the implementation of linear GQ( $\lambda$ ), a gradient-based off-policy temporal-difference learning algorithm. Explanation of the intuition and theory behind the algorithm are provided elsewhere (e.g., Maei & Sutton 2010, Maei 2011). If you questions or concerns about the content in this document or the attached java code please email Adam White ([email protected]).

1 Requirements and Setting

For each use of GQ( $\lambda$ ) you will need to provide three question functions specifying the quantity to be predicted, and four answer functions characterizing the approximation that will be found. Let $\mathcal{S}$ and $\mathcal{A}$ denote the sets of states and actions. Then the question functions are:

•

$\pi:\mathcal{S}\times\mathcal{A}\rightarrow[0,1]$ ; target policy to be learned. Incidently, if $\pi$ is chosen as the greedy policy with respect to the learned value function, then the algorithm will implement a generalization of the Greedy-GQ algorithm (Maei, Szepesvari, Bhatnagar & Sutton 2010).

•

$\gamma:\mathcal{S}\rightarrow[0,1]$ ; termination or discounting function ( $\gamma(s)=1-\beta(s)$ in GQ paper)

•

$r:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}$ ; reward function

In many publications there is also specified a fourth question function, the terminal reward function $z:\mathcal{S}\rightarrow\mathbb{R}$ used to specify a final reward at termination. More recently its has been recognized that this functionality can be included in the reward function, making use of the discounting function (Modayil, White & Sutton 2014). For example, if one wanted only a terminal reward function $z(s)$ upon termination in state $s$ , one would use a reward function of $r(s,a,s^{\prime})=(1-\gamma(s^{\prime}))z(s^{\prime})$ . This completes the specification of the predictive question that you are seeking to answer using the GQ( $\lambda$ ) algorithm.

The answer functions are:

•

$b:\mathcal{S}\times\mathcal{A}\rightarrow[0,1]$ ; behavior policy

•

$I:\mathcal{S}\times\mathcal{A}\rightarrow[0,1]$ ; interest function (can set to 1 for all state-action pairs or indicate selected state-action pairs to be best approximated)

•

$\bm{\phi}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{n}$ ; feature-vector function

•

$\lambda:\mathcal{S}\rightarrow[0,1]$ ; bootstrapping or eligibility-trace decay-rate function

The following data structures are internal to GQ:

•

$\bm{\theta}\in\mathbb{R}^{n}$ ; the learned weights of the linear approximation: $Q^{\pi}(s,a)=\bm{\theta}^{\top}\bm{\phi}(s,a)=\sum_{i=1}^{n}\bm{\theta}_{i}\bm{\phi}_{i}(s,a)$

•

${\bm{w}}\in\mathbb{R}^{n}$ ; secondary set of learned weights

•

${\bm{e}}\in\mathbb{R}^{n}$ ; eligibility trace vector

Parameters internal to GQ:

•

$\alpha$ ; step-size parameter for learning $\bm{\theta}$

•

$\eta\in[0,1]$ ; relative step-size parameter for learning ${\bm{w}}$ $(\alpha\eta)$

2 Algorithm Specification

We can now specify GQ( $\lambda$ ). Let ${\bm{w}}$ and ${\bm{e}}$ be initialized to zero and $\bm{\theta}$ be initialized arbitrarily. Let the subscript $t$ denote the current time step. Let $\rho_{t}$ denote the “importance sampling” ratio:

[TABLE]

where $S_{t}$ and $A_{t}$ are the state and action occuring on time step $t$ . Let $\bar{\bm{\phi}}_{t}$ denote the expected next feature vector, defined by:

[TABLE]

Then the following equations fully specify GQ( $\lambda$ ):

[TABLE]

3 Pseudocode

The following pseudocode characterizes the algorithm and its use.

Initialize $\bm{\theta}$ arbitrarily and ${\bm{w}}=0$

Repeat (for each episode):

Initialize ${\bm{e}}=0$

$S\leftarrow$ initial state of episode

Repeat (for each step of episode):

$A\leftarrow$ action selected by policy $b$ in state $S$

Take action $A$ , observe next state, $S^{\prime}$

$\bar{\bm{\phi}}\leftarrow 0$

For all $a\in\mathcal{A}(s)$ :

$\bar{\bm{\phi}}\leftarrow\bar{\bm{\phi}}+\pi(S^{\prime},a)\bm{\phi}(S^{\prime},a)$

$\rho=\frac{\pi(S,A)}{b(S,A)}$

GQlearn( $\bm{\phi}(S,A),\bar{\bm{\phi}},\lambda(S^{\prime}),\gamma(S^{\prime}),r(S,A,S^{\prime}),\rho,I(S)$ )

$S\leftarrow S^{\prime}$

until $S^{\prime}$ is terminal

GQ Learn( $\bm{\phi},\bar{\bm{\phi}},\lambda,\gamma,R,\rho,I$ )

$\delta\leftarrow R+\gamma\bm{\theta}^{\top}\bar{\bm{\phi}}-\bm{\theta}^{\top}\bm{\phi}$

${\bm{e}}\leftarrow\rho{\bm{e}}+I\bm{\phi}$

$\bm{\theta}\leftarrow\bm{\theta}+\alpha(\delta{\bm{e}}-\gamma(1-\lambda)({\bm{w}}^{\top}{\bm{e}})\bar{\bm{\phi}})$

${\bm{w}}\leftarrow{\bm{w}}+\alpha\eta(\delta{\bm{e}}-({\bm{w}}^{\top}\bm{\phi})\bm{\phi})$

${\bm{e}}\leftarrow\gamma\lambda{\bm{e}}$

4 Code

The files GQlambda.java and GQlambda.cpp (in the arXiv source archive) contain implementations of the GQlearn function described in the pseudocode. We have excluded optimizations (e.g., binary features or efficient trace implementation) to ensure the code is simple and easy to understand. We leave it to the reader to provide environment code for interfacing to GQ( $\lambda$ ) (e.g., using RL-Glue).

5 References

Maei, H. R., Szepesvári, Cs., Bhatnagar, S., Sutton, R. S. (2010). Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel.

Maei, H. R. and Sutton, R. S. (2010). GQ( $\lambda$ ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Proceedings of the Third Conference on Artificial General Intelligence, pp. 91–96.

Modayil, J., White, A., Sutton, R. S. (2014). Multi-timescale nexting in a reinforcement learning robot. Adaptive Behavior 22(2):146–160.

Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

GQ(λ\lambdaλ)

1 Requirements and Setting

2 Algorithm Specification

3 Pseudocode

4 Code

5 References

GQ( $\lambda$ )