Sublinear Regret for Learning POMDPs
Yi Xiong, Ningyuan Chen, Xuefeng Gao, Xiang Zhou

TL;DR
This paper introduces a novel algorithm for model-based reinforcement learning in POMDPs that achieves sublinear regret, advancing the theoretical understanding of learning in partially observable environments.
Contribution
It presents the first algorithm with sublinear regret bounds for general POMDPs, combining spectral methods, belief error control, and confidence bounds.
Findings
Achieves regret bound of O(T^{2/3}√log T)
First sublinear regret algorithm for general POMDPs
Uses spectral methods and belief error control
Abstract
We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, the belief error control in POMDPs and upper-confidence-bound methods for online learning. We establish a regret bound of for the proposed learning algorithm where is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms
