A Covering Framework for Offline POMDPs Learning using Belief Space Metric

Youheng Zhu; Yiping Lu

arXiv:2603.03191·stat.ML·March 4, 2026

A Covering Framework for Offline POMDPs Learning using Belief Space Metric

Youheng Zhu, Yiping Lu

PDF

Open Access

TL;DR

This paper presents a new covering analysis framework for offline POMDPs that leverages belief space metrics to provide tighter error bounds and reduce sample complexity in off-policy evaluation.

Contribution

It introduces a belief space metric-based covering analysis that relaxes traditional coverage assumptions and applies broadly to various OPE algorithms in POMDPs.

Findings

01

Improved sample efficiency demonstrated in case studies.

02

Tighter error bounds using belief space metrics.

03

Applicable to multiple OPE algorithms.

Abstract

In off policy evaluation (OPE) for partially observable Markov decision processes (POMDPs), an agent must infer hidden states from past observations, which exacerbates both the curse of horizon and the curse of memory in existing OPE methods. This paper introduces a novel covering analysis framework that exploits the intrinsic metric structure of the belief space (distributions over latent states) to relax traditional coverage assumptions. By assuming value relevant functions are Lipschitz continuous in the belief space, we derive error bounds that mitigate exponential blow ups in horizon and memory length. Our unified analysis technique applies to a broad class of OPE algorithms, yielding concrete error bounds and coverage requirements expressed in terms of belief space metrics rather than raw history coverage. We illustrate the improved sample efficiency of this framework via case…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Age of Information Optimization · Advanced Bandit Algorithms Research