Reinforcement Learning with Random Time Horizons
Enric Ribera Borrell, Lorenz Richter, Christof Sch\"utte

TL;DR
This paper extends reinforcement learning to include random stopping times, deriving new policy gradient formulas and demonstrating improved optimization convergence in practical experiments.
Contribution
It introduces a rigorous framework for RL with random time horizons, deriving policy gradient formulas that account for trajectory-dependent stopping times.
Findings
New policy gradient formulas for random horizons
Improved convergence in numerical experiments
Connections to optimal control theory
Abstract
We extend the standard reinforcement learning framework to random time horizons. While the classical setting typically assumes finite and deterministic or infinite runtimes of trajectories, we argue that multiple real-world applications naturally exhibit random (potentially trajectory-dependent) stopping times. Since those stopping times typically depend on the policy, their randomness has an effect on policy gradient formulas, which we (mostly for the first time) derive rigorously in this work both for stochastic and deterministic policies. We present two complementary perspectives, trajectory or state-space based, and establish connections to optimal control theory. Our numerical experiments demonstrate that using the proposed formulas can significantly improve optimization convergence compared to traditional approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Neural Networks and Applications
