Explicit Explore, Exploit, or Escape ($E^4$): near-optimal safety-constrained reinforcement learning in polynomial time
David M. Bossens, Nicholas Bishop

TL;DR
This paper introduces $E^4$, a model-based reinforcement learning algorithm that ensures safety and near-optimal performance in constrained environments by explicitly managing exploration, exploitation, and escape strategies within polynomial time.
Contribution
The paper extends the $E^{3}$ algorithm to a robust constrained setting, providing a polynomial-time method for safe, near-optimal policy learning in unknown environments.
Findings
$E^4$ guarantees safety constraints during learning.
$E^4$ finds near-optimal policies in polynomial time.
Theoretical analysis supports robustness and efficiency.
Abstract
In reinforcement learning (RL), an agent must explore an initially unknown environment in order to learn a desired behaviour. When RL agents are deployed in real world environments, safety is of primary concern. Constrained Markov decision processes (CMDPs) can provide long-term safety constraints; however, the agent may violate the constraints in an effort to explore its environment. This paper proposes a model-based RL algorithm called Explicit Explore, Exploit, or Escape (), which extends the Explicit Explore or Exploit () algorithm to a robust CMDP setting. explicitly separates exploitation, exploration, and escape CMDPs, allowing targeted policies for policy improvement across known states, discovery of unknown states, as well as safe return to known states. robustly optimises these policies on the worst-case CMDP from a set of CMDP models consistent with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
