Mildly Conservative Regularized Evaluation for Offline Reinforcement Learning
Haohui Chen, Zhiyong Chen

TL;DR
This paper introduces MCRE, a framework that balances conservatism and performance in offline RL by combining TD error with behavior cloning, leading to improved algorithms like MCRQ that outperform existing methods.
Contribution
The paper proposes the MCRE framework and the MCRQ algorithm, which effectively balance conservatism and performance in offline RL, advancing the state-of-the-art.
Findings
MCRQ outperforms strong baselines on benchmark datasets.
MCRE effectively balances conservatism and performance.
The approach reduces overestimation and improves policy learning.
Abstract
Offline reinforcement learning (RL) seeks to learn optimal policies from static datasets without further environment interaction. A key challenge is the distribution shift between the learned and behavior policies, leading to out-of-distribution (OOD) actions and overestimation. To prevent gross overestimation, the value function must remain conservative; however, excessive conservatism may hinder performance improvement. To address this, we propose the mildly conservative regularized evaluation (MCRE) framework, which balances conservatism and performance by combining temporal difference (TD) error with a behavior cloning term in the Bellman backup. Building on this, we develop the mildly conservative regularized Q-learning (MCRQ) algorithm, which integrates MCRE into an off-policy actor-critic framework. Experiments show that MCRQ outperforms strong baselines and state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdaptive Dynamic Programming Control · Reinforcement Learning in Robotics · Elevator Systems and Control
