Q($\lambda$) with Off-Policy Corrections

Anna Harutyunyan; Marc G. Bellemare; Tom Stepleton; Remi; Munos

arXiv:1602.04951·cs.AI·August 12, 2016·23 cites

Q($\lambda$) with Off-Policy Corrections

Anna Harutyunyan, Marc G. Bellemare, Tom Stepleton, Remi, Munos

PDF

Open Access

TL;DR

This paper introduces an alternative off-policy TD($5$) method that uses current Q-function-based reward corrections, providing convergence guarantees and empirical validation in continuous control tasks.

Contribution

It proposes a novel off-policy correction approach using current Q-values for rewards, with theoretical convergence analysis and empirical demonstration.

Findings

01

Convergence achieved under specific policy and parameter conditions

02

Theoretical relationship between policy divergence, trace parameter, and discount factor

03

Empirical validation on a continuous-state control task

Abstract

We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided certain conditions. These conditions relate the distance between the target and behavior policies, the eligibility trace parameter and the discount factor, and formalize an underlying tradeoff in off-policy TD( $λ$ ). We illustrate this theoretical relationship empirically on a continuous-state control task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCryptography and Data Security · Complexity and Algorithms in Graphs · Formal Methods in Verification

MethodsEligibility Trace