Stochastic Decision Horizons for Constrained Reinforcement Learning
Nikola Milosevic, Leonard Franz, Daniel Haeufle, Georg Martius, Nico Scherf, Pavel Kolev

TL;DR
This paper introduces a novel approach to constrained reinforcement learning using stochastic decision horizons, improving off-policy scalability and sample efficiency by integrating survival-weighted objectives and new violation semantics.
Contribution
It proposes a Control as Inference framework with stochastic decision horizons and two violation semantics, enabling scalable and efficient constrained RL with improved performance.
Findings
Enhanced sample efficiency on standard benchmarks.
Effective scaling to high-dimensional musculoskeletal tasks.
Distinct optimization structures for different violation semantics.
Abstract
Constrained Markov decision processes (CMDPs) provide a principled model for handling constraints, such as safety and other auxiliary objectives, in reinforcement learning. The common approach of using additive-cost constraints and dual variables often hinders off-policy scalability. We propose a Control as Inference formulation based on stochastic decision horizons, where constraint violations attenuate reward contributions and shorten the effective planning horizon via state-action-dependent continuation. This yields survival-weighted objectives that remain replay-compatible for off-policy actor-critic learning. We propose two violation semantics, absorbing and virtual termination, that share the same survival-weighted return but result in distinct optimization structures that lead to SAC/MPO-style policy improvement. Experiments demonstrate improved sample efficiency and favorable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Adaptive Dynamic Programming Control
