Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning
Yichen Chen, Mengdi Wang

TL;DR
This paper introduces stochastic primal-dual methods for online policy estimation in Markov decision processes, achieving near-optimal sample complexity with low computational cost.
Contribution
It proposes a novel class of SPD algorithms leveraging Bellman duality, with proven sample complexity bounds for both infinite and finite horizon MDPs.
Findings
Achieves absolute-$\\epsilon$-optimal policy with high probability.
Provides sample complexity bounds depending on state, action space, and discount factor.
Low per-iteration computational complexity.
Abstract
We study the online estimation of the optimal policy of a Markov decision process (MDP). We propose a class of Stochastic Primal-Dual (SPD) methods which exploit the inherent minimax duality of Bellman equations. The SPD methods update a few coordinates of the value and policy estimates as a new state transition is observed. These methods use small storage and has low computational complexity per iteration. The SPD methods find an absolute--optimal policy, with high probability, using iterations/samples for the infinite-horizon discounted-reward MDP and for the finite-horizon MDP.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Neural Networks and Applications · Advanced Bandit Algorithms Research
