Efficient, Low-Regret, Online Reinforcement Learning for Linear MDPs
Philips George John, Arnab Bhattacharyya, Silviu Maniu, Dimitrios, Myrisiotis, Zhenan Wu

TL;DR
This paper introduces modified online reinforcement learning algorithms for linear MDPs that reduce space and time complexity while maintaining low regret, validated through experiments on synthetic and real data.
Contribution
It proposes two variants of LSVI-UCB that alternate learning periods to improve efficiency without sacrificing regret guarantees.
Findings
Achieve low space and time complexity in experiments
Maintain sublinear regret with the modifications
Perform well on both synthetic and real-world benchmarks
Abstract
Reinforcement learning algorithms are usually stated without theoretical guarantees regarding their performance. Recently, Jin, Yang, Wang, and Jordan (COLT 2020) showed a polynomial-time reinforcement learning algorithm (namely, LSVI-UCB) for the setting of linear Markov decision processes, and provided theoretical guarantees regarding its running time and regret. In real-world scenarios, however, the space usage of this algorithm can be prohibitive due to a utilized linear regression step. We propose and analyze two modifications of LSVI-UCB, which alternate periods of learning and not-learning, to reduce space and time usage while maintaining sublinear regret. We show experimentally, on synthetic data and real-world benchmarks, that our algorithms achieve low space usage and running time, while not significantly sacrificing regret.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Smart Grid Energy Management · Adaptive Dynamic Programming Control
MethodsLinear Regression
