Q-Distribution guided Q-learning for offline reinforcement learning:   Uncertainty penalized Q-value via consistency model

Jing Zhang; Linjiajie Fang; Kexin Shi; Wenjia Wang; Bing-Yi Jing

arXiv:2410.20312·cs.LG·January 14, 2025

Q-Distribution guided Q-learning for offline reinforcement learning: Uncertainty penalized Q-value via consistency model

Jing Zhang, Linjiajie Fang, Kexin Shi, Wenjia Wang, Bing-Yi Jing

PDF

Open Access 1 Repo

TL;DR

This paper introduces Q-Distribution Guided Q-Learning (QDQ), a method that penalizes uncertain out-of-distribution actions in offline reinforcement learning using a consistency model for better Q-value estimation.

Contribution

The paper proposes a novel uncertainty-aware Q-learning approach that leverages a consistency model to improve Q-value estimates and reduce overestimation in offline RL.

Findings

01

QDQ achieves strong performance on D4RL benchmarks.

02

The method provides theoretical guarantees for Q-value distribution accuracy.

03

QDQ outperforms existing offline RL methods across multiple tasks.

Abstract

``Distribution shift'' is the main obstacle to the success of offline reinforcement learning. A learning policy may take actions beyond the behavior policy's knowledge, referred to as Out-of-Distribution (OOD) actions. The Q-values for these OOD actions can be easily overestimated. As a result, the learning policy is biased by using incorrect Q-value estimates. One common approach to avoid Q-value overestimation is to make a pessimistic adjustment. Our key idea is to penalize the Q-values of OOD actions associated with high uncertainty. In this work, we propose Q-Distribution Guided Q-Learning (QDQ), which applies a pessimistic adjustment to Q-values in OOD regions based on uncertainty estimation. This uncertainty measure relies on the conditional Q-value distribution, learned through a high-fidelity and efficient consistency model. Additionally, to prevent overly conservative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

evalarzj/qdq
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFault Detection and Control Systems · Neural Networks and Applications · Elevator Systems and Control

MethodsQ-Learning