Mildly Conservative Q-Learning for Offline Reinforcement Learning
Jiafei Lyu, Xiaoteng Ma, Xiu Li, Zongqing Lu

TL;DR
This paper introduces Mildly Conservative Q-learning (MCQ), a novel offline RL method that balances conservatism and generalization, leading to improved performance and transferability on benchmark tasks.
Contribution
MCQ actively trains OOD actions with pseudo Q values, providing a theoretically justified approach that enhances offline RL performance without excessive pessimism.
Findings
MCQ outperforms prior methods on D4RL benchmarks.
MCQ demonstrates superior transfer from offline to online settings.
MCQ maintains conservative estimates without overestimating OOD actions.
Abstract
Offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment. The distribution shift between the learned policy and the behavior policy makes it necessary for the value function to stay conservative such that out-of-distribution (OOD) actions will not be severely overestimated. However, existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic, which suppresses the generalization of the value function and hinders the performance improvement. This paper explores mild but enough conservatism for offline learning while not harming generalization. We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values. We theoretically show that MCQ induces a policy that behaves at least as well…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Data Classification · Domain Adaptation and Few-Shot Learning
MethodsQ-Learning
