Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and   Tighter Regret Bounds for the Non-Episodic Setting

Ziping Xu; Ambuj Tewari

arXiv:2002.02302·stat.ML·June 9, 2020·5 cites

Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and Tighter Regret Bounds for the Non-Episodic Setting

Ziping Xu, Ambuj Tewari

PDF

Open Access 1 Video

TL;DR

This paper introduces near-optimal, oracle-efficient algorithms for reinforcement learning in non-episodic factored MDPs, providing tighter regret bounds based on a new connectivity measure called factored span.

Contribution

The paper presents two novel algorithms with regret bounds depending on factored span, and introduces the factored span as a tighter connectivity measure for FMDPs.

Findings

01

Algorithms outperform previous methods in network simulations.

02

Regret bounds depend on factored span, not diameter.

03

Tighter lower bounds established for FMDPs.

Abstract

We study reinforcement learning in non-episodic factored Markov decision processes (FMDPs). We propose two near-optimal and oracle-efficient algorithms for FMDPs. Assuming oracle access to an FMDP planner, they enjoy a Bayesian and a frequentist regret bound respectively, both of which reduce to the near-optimal bound $O (D S A T)$ for standard non-factored MDPs. We propose a tighter connectivity measure, factored span, for FMDPs and prove a lower bound that depends on the factored span rather than the diameter $D$ . In order to decrease the gap between lower and upper bounds, we propose an adaptation of the REGAL.C algorithm whose regret bound depends on the factored span. Our oracle-efficient algorithms outperform previously proposed near-optimal algorithms on computer network administration simulations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and Tighter Regret Bounds for the Non-Episodic Setting· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Artificial Intelligence in Games