A Provably-Efficient Model-Free Algorithm for Constrained Markov   Decision Processes

Honghao Wei; Xin Liu; Lei Ying

arXiv:2106.01577·cs.LG·October 26, 2021·6 cites

A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes

Honghao Wei, Xin Liu, Lei Ying

PDF

Open Access

TL;DR

This paper introduces Triple-Q, a model-free reinforcement learning algorithm for Constrained Markov Decision Processes that achieves sublinear regret and zero constraint violation without requiring a model or simulator.

Contribution

The paper proposes the first provably-efficient, model-free algorithm for CMDPs with theoretical regret bounds and guaranteed zero constraint violations.

Findings

01

Achieves $ ilde{ ext{O}}(rac{1}{ ext{delta}}H^4 S^{1/2}A^{1/2}K^{4/5})$ regret.

02

Guarantees zero constraint violation with high probability.

03

Computational complexity similar to SARSA for unconstrained MDPs.

Abstract

This paper presents the first model-free, simulator-free reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The algorithm is named Triple-Q because it includes three key components: a Q-function (also called action-value function) for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation. Under Triple-Q, at each step, an action is chosen based on the pseudo-Q-value that is a combination of the three "Q" values. The algorithm updates the reward and utility Q-values with learning rates that depend on the visit counts to the corresponding (state, action) pairs and are periodically reset. In the episodic CMDP setting, Triple-Q achieves $\tilde{\cal O}\left(\frac{1 }{\delta}H^4…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Smart Grid Energy Management

MethodsSarsa