Fast Non-Episodic Finite-Horizon RL with K-Step Lookahead Thresholding
Jiamin Xu, Kyra Gan

TL;DR
This paper introduces a novel K-step lookahead thresholding method for non-episodic finite-horizon reinforcement learning, achieving fast convergence and superior empirical performance over existing tabular algorithms.
Contribution
It proposes a new truncated K-step lookahead Q-function with a thresholding mechanism, along with an efficient algorithm with proven minimax optimal regret bounds.
Findings
Achieves minimax optimal constant regret for K=1.
Attains O(max(K-1,C_{K-1})√SAT log T) regret for K≥2.
Demonstrates superior empirical rewards on synthetic and real RL environments.
Abstract
Online reinforcement learning in non-episodic, finite-horizon MDPs remains underexplored and is challenged by the need to estimate returns to a fixed terminal time. Existing infinite-horizon methods, which often rely on discounted contraction, do not naturally account for this fixed-horizon structure. We introduce a modified Q-function: rather than targeting the full-horizon, we learn a K-step lookahead Q-function that truncates planning to the next K steps. To further improve sample efficiency, we introduce a thresholding mechanism: actions are selected only when their estimated K-step lookahead value exceeds a time-varying threshold. We provide an efficient tabular learning algorithm for this novel objective, proving it achieves fast finite-sample convergence: it achieves minimax optimal constant regret for and regret for any…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques · Reinforcement Learning in Robotics
