Sublinear Regret for Learning POMDPs

Yi Xiong; Ningyuan Chen; Xuefeng Gao; Xiang Zhou

arXiv:2107.03635·cs.LG·July 19, 2022·1 cites

Sublinear Regret for Learning POMDPs

Yi Xiong, Ningyuan Chen, Xuefeng Gao, Xiang Zhou

PDF

Open Access

TL;DR

This paper introduces a novel algorithm for model-based reinforcement learning in POMDPs that achieves sublinear regret, advancing the theoretical understanding of learning in partially observable environments.

Contribution

It presents the first algorithm with sublinear regret bounds for general POMDPs, combining spectral methods, belief error control, and confidence bounds.

Findings

01

Achieves regret bound of O(T^{2/3}√log T)

02

First sublinear regret algorithm for general POMDPs

03

Uses spectral methods and belief error control

Abstract

We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, the belief error control in POMDPs and upper-confidence-bound methods for online learning. We establish a regret bound of $O (T^{2/3} lo g T)$ for the proposed learning algorithm where $T$ is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms