Fast Non-Episodic Finite-Horizon RL with K-Step Lookahead Thresholding

Jiamin Xu; Kyra Gan

arXiv:2602.00781·cs.LG·February 3, 2026

Fast Non-Episodic Finite-Horizon RL with K-Step Lookahead Thresholding

Jiamin Xu, Kyra Gan

PDF

Open Access

TL;DR

This paper introduces a novel K-step lookahead thresholding method for non-episodic finite-horizon reinforcement learning, achieving fast convergence and superior empirical performance over existing tabular algorithms.

Contribution

It proposes a new truncated K-step lookahead Q-function with a thresholding mechanism, along with an efficient algorithm with proven minimax optimal regret bounds.

Findings

01

Achieves minimax optimal constant regret for K=1.

02

Attains O(max(K-1,C_{K-1})√SAT log T) regret for K≥2.

03

Demonstrates superior empirical rewards on synthetic and real RL environments.

Abstract

Online reinforcement learning in non-episodic, finite-horizon MDPs remains underexplored and is challenged by the need to estimate returns to a fixed terminal time. Existing infinite-horizon methods, which often rely on discounted contraction, do not naturally account for this fixed-horizon structure. We introduce a modified Q-function: rather than targeting the full-horizon, we learn a K-step lookahead Q-function that truncates planning to the next K steps. To further improve sample efficiency, we introduce a thresholding mechanism: actions are selected only when their estimated K-step lookahead value exceeds a time-varying threshold. We provide an efficient tabular learning algorithm for this novel objective, proving it achieves fast finite-sample convergence: it achieves minimax optimal constant regret for $K = 1$ and $O (max ((K - 1), C_{K - 1}) S A T lo g (T))$ regret for any…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques · Reinforcement Learning in Robotics