Bayesian Conservative Policy Optimization (BCPO): A Novel Uncertainty-Calibrated Offline Reinforcement Learning with Credible Lower Bounds

Debashis Chatterjee

arXiv:2603.12284·stat.ME·March 16, 2026

Bayesian Conservative Policy Optimization (BCPO): A Novel Uncertainty-Calibrated Offline Reinforcement Learning with Credible Lower Bounds

Debashis Chatterjee

PDF

Open Access

TL;DR

This paper introduces BCPO, a Bayesian offline RL method that uses uncertainty to ensure conservative policy updates, improving robustness against distribution shifts and overestimation errors.

Contribution

The paper proposes BCPO, a novel Bayesian framework that converts epistemic uncertainty into credible lower bounds for safe offline policy optimization.

Findings

01

BCPO provides high-probability lower bounds on true value functions.

02

Empirical results on CartPole show improved stability and calibration.

03

Theoretical analysis guarantees conservative fixed points and policy improvement.

Abstract

Offline reinforcement learning (RL) aims to learn decision policies from a fixed batch of logged transitions, without additional environment interaction. Despite remarkable empirical progress, offline RL remains fragile under distribution shifts: value-based methods can overestimate the value of unseen actions, yielding policies that exploit model errors rather than genuine long-term rewards. We propose \emph{Bayesian Conservative Policy Optimization (BCPO)}, a unified framework that converts epistemic uncertainty into \emph{provably conservative} policy improvement. BCPO maintains a hierarchical Bayesian posterior over environment/value models, constructs a \emph{credible lower bound} (LCB) on action values, and performs policy updates under explicit KL regularization toward the behavior distribution. This yields an uncertainty-calibrated analogue of conservative policy iteration in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adversarial Robustness in Machine Learning