Bellman Calibration for $V$-Learning in Offline Reinforcement Learning
Lars van der Laan, Nathan Kallus

TL;DR
This paper introduces Bellman calibration, a method for improving long-horizon value predictions in offline reinforcement learning by diagnosing and recalibrating value estimates without requiring Bellman completeness.
Contribution
It proposes a weak calibration criterion, a practical off-policy diagnostic, and a post-hoc recalibration procedure with theoretical guarantees, advancing value prediction reliability.
Findings
Bellman calibration error can be estimated from off-policy data.
Iterated Bellman Calibration improves value predictions without Bellman completeness.
Finite-sample guarantees control calibration error at nonparametric rates.
Abstract
Reliable long-horizon value prediction is difficult in offline reinforcement learning because fitted value methods combine bootstrapping, function approximation, and distribution shift, while standard guarantees often require Bellman completeness or realizability. We introduce Bellman calibration, a weak reliability criterion requiring that states assigned similar predicted values have average Bellman targets that agree with those predictions. This criterion yields a scalar calibration error for diagnosing systematic numerical miscalibration, which we estimate from off-policy data using doubly robust Bellman target estimates. We then propose Iterated Bellman Calibration, a model-agnostic post-hoc procedure that recalibrates any learned value predictor by fitting a one-dimensional map of its original prediction, with histogram and isotonic variants. We prove finite-sample guarantees…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
