Q-Distribution guided Q-learning for offline reinforcement learning: Uncertainty penalized Q-value via consistency model
Jing Zhang, Linjiajie Fang, Kexin Shi, Wenjia Wang, Bing-Yi Jing

TL;DR
This paper introduces Q-Distribution Guided Q-Learning (QDQ), a method that penalizes uncertain out-of-distribution actions in offline reinforcement learning using a consistency model for better Q-value estimation.
Contribution
The paper proposes a novel uncertainty-aware Q-learning approach that leverages a consistency model to improve Q-value estimates and reduce overestimation in offline RL.
Findings
QDQ achieves strong performance on D4RL benchmarks.
The method provides theoretical guarantees for Q-value distribution accuracy.
QDQ outperforms existing offline RL methods across multiple tasks.
Abstract
``Distribution shift'' is the main obstacle to the success of offline reinforcement learning. A learning policy may take actions beyond the behavior policy's knowledge, referred to as Out-of-Distribution (OOD) actions. The Q-values for these OOD actions can be easily overestimated. As a result, the learning policy is biased by using incorrect Q-value estimates. One common approach to avoid Q-value overestimation is to make a pessimistic adjustment. Our key idea is to penalize the Q-values of OOD actions associated with high uncertainty. In this work, we propose Q-Distribution Guided Q-Learning (QDQ), which applies a pessimistic adjustment to Q-values in OOD regions based on uncertainty estimation. This uncertainty measure relies on the conditional Q-value distribution, learned through a high-fidelity and efficient consistency model. Additionally, to prevent overly conservative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems · Neural Networks and Applications · Elevator Systems and Control
MethodsQ-Learning
