Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement   Learning

Yichen Chen; Mengdi Wang

arXiv:1612.02516·stat.ML·December 9, 2016·43 cites

Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning

Yichen Chen, Mengdi Wang

PDF

Open Access

TL;DR

This paper introduces stochastic primal-dual methods for online policy estimation in Markov decision processes, achieving near-optimal sample complexity with low computational cost.

Contribution

It proposes a novel class of SPD algorithms leveraging Bellman duality, with proven sample complexity bounds for both infinite and finite horizon MDPs.

Findings

01

Achieves absolute-$\\epsilon$-optimal policy with high probability.

02

Provides sample complexity bounds depending on state, action space, and discount factor.

03

Low per-iteration computational complexity.

Abstract

We study the online estimation of the optimal policy of a Markov decision process (MDP). We propose a class of Stochastic Primal-Dual (SPD) methods which exploit the inherent minimax duality of Bellman equations. The SPD methods update a few coordinates of the value and policy estimates as a new state transition is observed. These methods use small storage and has low computational complexity per iteration. The SPD methods find an absolute- $ϵ$ -optimal policy, with high probability, using $O (\frac{∣ S ∣ ^{4} ∣ A ∣ ^{2} σ ^{2}}{( 1 - γ ) ^{6} ϵ ^{2}})$ iterations/samples for the infinite-horizon discounted-reward MDP and $O (\frac{∣ S ∣ ^{4} ∣ A ∣ ^{2} H ^{6} σ ^{2}}{ϵ ^{2}})$ for the finite-horizon MDP.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Neural Networks and Applications · Advanced Bandit Algorithms Research