Near-optimal Reinforcement Learning in Factored MDPs

Ian Osband; Benjamin Van Roy

arXiv:1403.3741·stat.ML·November 4, 2014·54 cites

Near-optimal Reinforcement Learning in Factored MDPs

Ian Osband, Benjamin Van Roy

PDF

Open Access

TL;DR

This paper demonstrates that for factored Markov decision processes, it is possible to develop reinforcement learning algorithms with regret bounds that depend polynomially on the number of parameters, significantly improving over traditional methods.

Contribution

The paper introduces two algorithms, PSRL and UCRL-Factored, that achieve near-optimal regret bounds in factored MDPs, leveraging the structure to handle large state and action spaces.

Findings

01

Regret scales polynomially with the number of parameters in factored MDPs.

02

Two algorithms, PSRL and UCRL-Factored, are proposed with proven regret bounds.

03

Traditional RL algorithms suffer high regret on large MDPs, but structure-aware algorithms perform better.

Abstract

Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer $Ω (S A T)$ regret on some MDP, where $T$ is the elapsed time and $S$ and $A$ are the cardinalities of the state and action spaces. This implies $T = Ω (S A)$ time to guarantee a near-optimal policy. In many settings of practical interest, due to the curse of dimensionality, $S$ and $A$ can be so enormous that this learning time is unacceptable. We establish that, if the system is known to be a \emph{factored} MDP, it is possible to achieve regret that scales polynomially in the number of \emph{parameters} encoding the factored MDP, which may be exponentially smaller than $S$ or $A$ . We provide two algorithms that satisfy near-optimal regret bounds in this context: posterior sampling reinforcement learning (PSRL) and an upper confidence bound algorithm (UCRL-Factored).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Smart Grid Energy Management