Highway Reinforcement Learning

Yuhui Wang; Miroslav Strupl; Francesco Faccio; Qingyuan Wu; Haozhe; Liu; Micha{\l} Grudzie\'n; Xiaoyang Tan; J\"urgen Schmidhuber

arXiv:2405.18289·cs.LG·May 29, 2024

Highway Reinforcement Learning

Yuhui Wang, Miroslav Strupl, Francesco Faccio, Qingyuan Wu, Haozhe, Liu, Micha{\l} Grudzie\'n, Xiaoyang Tan, J\"urgen Schmidhuber

PDF

Open Access

TL;DR

This paper introduces a novel off-policy reinforcement learning method with a highway gate that effectively utilizes distant future information, overcoming underestimation issues of traditional n-step methods and improving performance on delayed reward tasks.

Contribution

A new IS-free, multi-step off-policy RL algorithm with a highway gate that guarantees convergence to the optimal value function regardless of lookahead depth.

Findings

01

Outperforms existing multi-step off-policy algorithms on delayed reward tasks

02

Guarantees convergence to the optimal value function for any lookahead depth

03

Effectively utilizes distant future information through the highway gate mechanism

Abstract

Learning from multi-step off-policy data collected by a set of policies is a core problem of reinforcement learning (RL). Approaches based on importance sampling (IS) often suffer from large variances due to products of IS ratios. Typical IS-free methods, such as $n$ -step Q-learning, look ahead for $n$ time steps along the trajectory of actions (where $n$ is called the lookahead depth) and utilize off-policy data directly without any additional adjustment. They work well for proper choices of $n$ . We show, however, that such IS-free methods underestimate the optimal value function (VF), especially for large $n$ , restricting their capacity to efficiently utilize information from distant future time steps. To overcome this problem, we introduce a novel, IS-free, multi-step off-policy method that avoids the underestimation issue and converges to the optimal VF. At its core lies a simple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTraffic control and management

MethodsSparse Evolutionary Training