Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting
Alessio Russo, Alberto Maria Metelli, Marcello Restelli

TL;DR
This paper introduces an efficient method for learning average-reward POMDPs with known observation models, using spectral estimation and an exploration strategy that guarantees low regret and scales well with problem size.
Contribution
It proposes the OAS spectral estimation technique and the OAS-UCRL algorithm, providing the first regret guarantees for POMDPs with known observation models in the average-reward setting.
Findings
Regret bound of order O(√T log T) for the proposed algorithm.
Efficient scaling with state, action, and observation space dimensions.
Numerical simulations validate the approach against baselines.
Abstract
Dealing with Partially Observable Markov Decision Processes is notably a challenging task. We face an average-reward infinite-horizon POMDP setting with an unknown transition model, where we assume the knowledge of the observation model. Under this assumption, we propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy. Then, we propose the OAS-UCRL algorithm that implicitly balances the exploration-exploitation trade-off following the principle. The algorithm runs through episodes of increasing length. For each episode, the optimal belief-based policy of the estimated POMDP interacts with the environment and collects samples that will be used in the next episode by the OAS estimation procedure to compute a new estimate of the POMDP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Criteria Decision Making · Fuzzy Systems and Optimization
