Exploration-Exploitation in Constrained MDPs
Yonathan Efroni, Shie Mannor, Matteo Pirotta

TL;DR
This paper studies the exploration-exploitation trade-off in constrained Markov Decision Processes (CMDPs), proposing two approaches that achieve sublinear regret in reward and constraint violations, with the linear programming method offering stronger guarantees.
Contribution
It introduces and compares two novel algorithms for learning in CMDPs, analyzing their regret bounds and highlighting the advantages of the linear programming approach over the dual formulation.
Findings
Both approaches achieve sublinear regret in reward and constraints.
The linear programming approach provides stronger theoretical guarantees.
The dual formulation approach allows for incremental updates of primal and dual variables.
Abstract
In many sequential decision-making problems, the goal is to optimize a utility function while satisfying a set of constraints on different utilities. This learning problem is formalized through Constrained Markov Decision Processes (CMDPs). In this paper, we investigate the exploration-exploitation dilemma in CMDPs. While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward while satisfying the constraints. While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process. In this work, we analyze two approaches for learning in CMDPs. The first approach leverages the linear formulation of CMDP to perform optimistic planning at each episode. The second approach leverages the dual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms
