Mildly Conservative Regularized Evaluation for Offline Reinforcement Learning

Haohui Chen; Zhiyong Chen

arXiv:2508.05960·cs.LG·August 11, 2025

Mildly Conservative Regularized Evaluation for Offline Reinforcement Learning

Haohui Chen, Zhiyong Chen

PDF

Open Access

TL;DR

This paper introduces MCRE, a framework that balances conservatism and performance in offline RL by combining TD error with behavior cloning, leading to improved algorithms like MCRQ that outperform existing methods.

Contribution

The paper proposes the MCRE framework and the MCRQ algorithm, which effectively balance conservatism and performance in offline RL, advancing the state-of-the-art.

Findings

01

MCRQ outperforms strong baselines on benchmark datasets.

02

MCRE effectively balances conservatism and performance.

03

The approach reduces overestimation and improves policy learning.

Abstract

Offline reinforcement learning (RL) seeks to learn optimal policies from static datasets without further environment interaction. A key challenge is the distribution shift between the learned and behavior policies, leading to out-of-distribution (OOD) actions and overestimation. To prevent gross overestimation, the value function must remain conservative; however, excessive conservatism may hinder performance improvement. To address this, we propose the mildly conservative regularized evaluation (MCRE) framework, which balances conservatism and performance by combining temporal difference (TD) error with a behavior cloning term in the Bellman backup. Building on this, we develop the mildly conservative regularized Q-learning (MCRQ) algorithm, which integrates MCRE into an off-policy actor-critic framework. Experiments show that MCRQ outperforms strong baselines and state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdaptive Dynamic Programming Control · Reinforcement Learning in Robotics · Elevator Systems and Control