Model-Free Algorithm and Regret Analysis for MDPs with Long-Term Constraints
Qinbo Bai, Vaneet Aggarwal, Ather Gattami

TL;DR
This paper introduces a model-free algorithm for constrained Markov Decision Processes with long-term constraints, providing the first regret analysis in the setting where transition probabilities are unknown.
Contribution
It proposes a novel algorithm combining constrained optimization and Q-learning for long-term constrained MDPs without known transition models.
Findings
Achieves $O(T^{1/2+eta})$ regret for reward maximization.
Achieves $O(T^{1-eta/2})$ regret for constraint violation.
First regret bounds for model-free long-term constrained MDPs.
Abstract
In the optimization of dynamical systems, the variables typically have constraints. Such problems can be modeled as a constrained Markov Decision Process (CMDP). This paper considers a model-free approach to the problem, where the transition probabilities are not known. In the presence of long-term (or average) constraints, the agent has to choose a policy that maximizes the long-term average reward as well as satisfy the average constraints in each episode. The key challenge with the long-term constraints is that the optimal policy is not deterministic in general, and thus standard Q-learning approaches cannot be directly used. This paper uses concepts from constrained optimization and Q-learning to propose an algorithm for CMDP with long-term constraints. For any , the proposed algorithm is shown to achieve regret bound for the obtained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms
MethodsQ-Learning
