Model-Free Algorithm and Regret Analysis for MDPs with Long-Term   Constraints

Qinbo Bai; Vaneet Aggarwal; Ather Gattami

arXiv:2006.05961·cs.LG·February 2, 2021

Model-Free Algorithm and Regret Analysis for MDPs with Long-Term Constraints

Qinbo Bai, Vaneet Aggarwal, Ather Gattami

PDF

Open Access

TL;DR

This paper introduces a model-free algorithm for constrained Markov Decision Processes with long-term constraints, providing the first regret analysis in the setting where transition probabilities are unknown.

Contribution

It proposes a novel algorithm combining constrained optimization and Q-learning for long-term constrained MDPs without known transition models.

Findings

01

Achieves $O(T^{1/2+eta})$ regret for reward maximization.

02

Achieves $O(T^{1-eta/2})$ regret for constraint violation.

03

First regret bounds for model-free long-term constrained MDPs.

Abstract

In the optimization of dynamical systems, the variables typically have constraints. Such problems can be modeled as a constrained Markov Decision Process (CMDP). This paper considers a model-free approach to the problem, where the transition probabilities are not known. In the presence of long-term (or average) constraints, the agent has to choose a policy that maximizes the long-term average reward as well as satisfy the average constraints in each episode. The key challenge with the long-term constraints is that the optimal policy is not deterministic in general, and thus standard Q-learning approaches cannot be directly used. This paper uses concepts from constrained optimization and Q-learning to propose an algorithm for CMDP with long-term constraints. For any $γ \in (0, \frac{1}{2})$ , the proposed algorithm is shown to achieve $O (T^{1/2 + γ})$ regret bound for the obtained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms

MethodsQ-Learning