Concave Utility Reinforcement Learning with Zero-Constraint Violations
Mridul Agarwal, Qinbo Bai, Vaneet Aggarwal

TL;DR
This paper introduces a model-based reinforcement learning algorithm for concave utility optimization with convex constraints, ensuring zero constraint violations and providing regret guarantees in tabular infinite-horizon settings.
Contribution
It proposes a novel optimization approach that guarantees zero constraint violations and offers regret bounds, improving computational efficiency for constrained reinforcement learning.
Findings
Achieves zero constraint violations in reinforcement learning.
Provides high-probability regret bounds of order (1/) with theoretical guarantees.
Applicable to both optimistic and posterior sampling algorithms.
Abstract
We consider the problem of tabular infinite horizon concave utility reinforcement learning (CURL) with convex constraints. For this, we propose a model-based learning algorithm that also achieves zero constraint violations. Assuming that the concave objective and the convex constraints have a solution interior to the set of feasible occupation measures, we solve a tighter optimization problem to ensure that the constraints are never violated despite the imprecise model knowledge and model stochasticity. We use Bellman error-based analysis for tabular infinite-horizon setups which allows analyzing stochastic policies. Combining the Bellman error-based analysis and tighter optimization equation, for interactions with the environment, we obtain a high-probability regret guarantee for objective which grows as , excluding other factors. The proposed method can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Risk and Portfolio Optimization
