GQ($\lambda$) Quick Reference and Implementation Guide
Adam White, Richard S. Sutton

TL;DR
This paper provides a quick reference and implementation guide for the linear GQ(λ) off-policy temporal-difference learning algorithm, including theoretical background and Java code for practical use.
Contribution
It offers a concise reference and implementation resources for GQ(λ), facilitating understanding and application of this gradient-based off-policy learning algorithm.
Findings
Provides a clear implementation guide for GQ(λ)
Includes Java code for practical use
Summarizes key theoretical aspects
Abstract
This document should serve as a quick reference for and guide to the implementation of linear GQ(), a gradient-based off-policy temporal-difference learning algorithm. Explanation of the intuition and theory behind the algorithm are provided elsewhere (e.g., Maei & Sutton 2010, Maei 2011). If you questions or concerns about the content in this document or the attached java code please email Adam White ([email protected]). The code is provided as part of the source files in the arXiv submission.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adaptive Dynamic Programming Control · Advanced Bandit Algorithms Research
MethodsAdam
GQ()
Quick Reference and Implementation Guide
Adam White and Richard S. Sutton
(Revised July 29, 2014)
This document should serve as a quick reference for and guide to the implementation of linear GQ(), a gradient-based off-policy temporal-difference learning algorithm. Explanation of the intuition and theory behind the algorithm are provided elsewhere (e.g., Maei & Sutton 2010, Maei 2011). If you questions or concerns about the content in this document or the attached java code please email Adam White ([email protected]).
1 Requirements and Setting
For each use of GQ() you will need to provide three question functions specifying the quantity to be predicted, and four answer functions characterizing the approximation that will be found. Let and denote the sets of states and actions. Then the question functions are:
- •
; target policy to be learned. Incidently, if is chosen as the greedy policy with respect to the learned value function, then the algorithm will implement a generalization of the Greedy-GQ algorithm (Maei, Szepesvari, Bhatnagar & Sutton 2010).
- •
; termination or discounting function ( in GQ paper)
- •
; reward function
In many publications there is also specified a fourth question function, the terminal reward function used to specify a final reward at termination. More recently its has been recognized that this functionality can be included in the reward function, making use of the discounting function (Modayil, White & Sutton 2014). For example, if one wanted only a terminal reward function upon termination in state , one would use a reward function of . This completes the specification of the predictive question that you are seeking to answer using the GQ() algorithm.
The answer functions are:
- •
; behavior policy
- •
; interest function (can set to 1 for all state-action pairs or indicate selected state-action pairs to be best approximated)
- •
; feature-vector function
- •
; bootstrapping or eligibility-trace decay-rate function
The following data structures are internal to GQ:
- •
; the learned weights of the linear approximation:
- •
; secondary set of learned weights
- •
; eligibility trace vector
Parameters internal to GQ:
- •
; step-size parameter for learning
- •
; relative step-size parameter for learning
2 Algorithm Specification
We can now specify GQ(). Let and be initialized to zero and be initialized arbitrarily. Let the subscript denote the current time step. Let denote the “importance sampling” ratio:
[TABLE]
where and are the state and action occuring on time step . Let denote the expected next feature vector, defined by:
[TABLE]
Then the following equations fully specify GQ():
[TABLE]
[TABLE]
[TABLE]
[TABLE]
3 Pseudocode
The following pseudocode characterizes the algorithm and its use.
Initialize arbitrarily and
Repeat (for each episode):
Initialize
initial state of episode
Repeat (for each step of episode):
action selected by policy in state
Take action , observe next state,
For all :
GQlearn()
until is terminal
GQ Learn()
4 Code
The files GQlambda.java and GQlambda.cpp (in the arXiv source archive) contain implementations of the GQlearn function described in the pseudocode. We have excluded optimizations (e.g., binary features or efficient trace implementation) to ensure the code is simple and easy to understand. We leave it to the reader to provide environment code for interfacing to GQ() (e.g., using RL-Glue).
5 References
Maei, H. R., Szepesvári, Cs., Bhatnagar, S., Sutton, R. S. (2010). Toward off-policy learning control with function approximation. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel.
Maei, H. R. and Sutton, R. S. (2010). GQ(): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Proceedings of the Third Conference on Artificial General Intelligence, pp. 91–96.
Modayil, J., White, A., Sutton, R. S. (2014). Multi-timescale nexting in a reinforcement learning robot. Adaptive Behavior 22(2):146–160.
Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press.
