Highway Reinforcement Learning
Yuhui Wang, Miroslav Strupl, Francesco Faccio, Qingyuan Wu, Haozhe, Liu, Micha{\l} Grudzie\'n, Xiaoyang Tan, J\"urgen Schmidhuber

TL;DR
This paper introduces a novel off-policy reinforcement learning method with a highway gate that effectively utilizes distant future information, overcoming underestimation issues of traditional n-step methods and improving performance on delayed reward tasks.
Contribution
A new IS-free, multi-step off-policy RL algorithm with a highway gate that guarantees convergence to the optimal value function regardless of lookahead depth.
Findings
Outperforms existing multi-step off-policy algorithms on delayed reward tasks
Guarantees convergence to the optimal value function for any lookahead depth
Effectively utilizes distant future information through the highway gate mechanism
Abstract
Learning from multi-step off-policy data collected by a set of policies is a core problem of reinforcement learning (RL). Approaches based on importance sampling (IS) often suffer from large variances due to products of IS ratios. Typical IS-free methods, such as -step Q-learning, look ahead for time steps along the trajectory of actions (where is called the lookahead depth) and utilize off-policy data directly without any additional adjustment. They work well for proper choices of . We show, however, that such IS-free methods underestimate the optimal value function (VF), especially for large , restricting their capacity to efficiently utilize information from distant future time steps. To overcome this problem, we introduce a novel, IS-free, multi-step off-policy method that avoids the underestimation issue and converges to the optimal VF. At its core lies a simple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTraffic control and management
MethodsSparse Evolutionary Training
