Near-Optimal Sample Complexity for Online Constrained MDPs
Chang Liu, Yunfan Li, Lin F. Yang

TL;DR
This paper introduces a model-based primal-dual algorithm for online constrained MDPs that achieves near-optimal sample complexity, effectively balancing safety constraints and learning efficiency in both relaxed and strict feasibility settings.
Contribution
It provides the first near-optimal sample complexity bounds for online constrained MDPs, matching lower bounds and handling both relaxed and strict safety constraints.
Findings
Achieves $ ilde{O}(SAH^3/ ext{epsilon}^2)$ episodes for relaxed feasibility.
Achieves $ ilde{O}(SAH^5/( ext{epsilon}^2 ext{zeta}^2))$ episodes for strict feasibility.
Demonstrates learning CMDPs as efficiently as unconstrained MDPs when violations are small.
Abstract
Safety is a fundamental challenge in reinforcement learning (RL), particularly in real-world applications such as autonomous driving, robotics, and healthcare. To address this, Constrained Markov Decision Processes (CMDPs) are commonly used to enforce safety constraints while optimizing performance. However, existing methods often suffer from significant safety violations or require a high sample complexity to generate near-optimal policies. We address two settings: relaxed feasibility, where small violations are allowed, and strict feasibility, where no violation is allowed. We propose a model-based primal-dual algorithm that balances regret and bounded constraint violations, drawing on techniques from online RL and constrained optimization. For relaxed feasibility, we prove that our algorithm returns an -optimal policy with -bounded violation with arbitrarily…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Advanced Bandit Algorithms Research
