A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes
Honghao Wei, Xin Liu, Lei Ying

TL;DR
This paper introduces Triple-Q, a model-free reinforcement learning algorithm for Constrained Markov Decision Processes that achieves sublinear regret and zero constraint violation without requiring a model or simulator.
Contribution
The paper proposes the first provably-efficient, model-free algorithm for CMDPs with theoretical regret bounds and guaranteed zero constraint violations.
Findings
Achieves $ ilde{ ext{O}}(rac{1}{ ext{delta}}H^4 S^{1/2}A^{1/2}K^{4/5})$ regret.
Guarantees zero constraint violation with high probability.
Computational complexity similar to SARSA for unconstrained MDPs.
Abstract
This paper presents the first model-free, simulator-free reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The algorithm is named Triple-Q because it includes three key components: a Q-function (also called action-value function) for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation. Under Triple-Q, at each step, an action is chosen based on the pseudo-Q-value that is a combination of the three "Q" values. The algorithm updates the reward and utility Q-values with learning rates that depend on the visit counts to the corresponding (state, action) pairs and are periodically reset. In the episodic CMDP setting, Triple-Q achieves $\tilde{\cal O}\left(\frac{1 }{\delta}H^4…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Smart Grid Energy Management
MethodsSarsa
