Understanding, Predicting and Better Resolving Q-Value Divergence in   Offline-RL

Yang Yue; Rui Lu; Bingyi Kang; Shiji Song; Gao Huang

arXiv:2310.04411·cs.LG·November 8, 2023·1 cites

Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL

Yang Yue, Rui Lu, Bingyi Kang, Shiji Song, Gao Huang

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper investigates the root cause of Q-value divergence in offline RL, introduces a NTK-based metric to predict divergence, and proposes LayerNorm to improve stability and performance.

Contribution

It identifies self-excitation as the main cause of divergence, develops a predictive NTK-based measure, and demonstrates LayerNorm as an effective architectural solution.

Findings

01

The SEEM metric predicts divergence early in training.

02

LayerNorm effectively prevents divergence without bias.

03

The method achieves state-of-the-art results on challenging offline RL tasks.

Abstract

The divergence of the Q-value estimation has been a prominent issue in offline RL, where the agent has no access to real dynamics. Traditional beliefs attribute this instability to querying out-of-distribution actions when bootstrapping value targets. Though this issue can be alleviated with policy constraints or conservative Q estimation, a theoretical understanding of the underlying mechanism causing the divergence has been absent. In this work, we aim to thoroughly comprehend this mechanism and attain an improved solution. We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. Then, we propose a novel Self-Excite Eigenvalue Measure (SEEM) metric based on Neural Tangent Kernel (NTK) to measure the evolving property of Q-network at training, which provides an intriguing explanation of the emergence of divergence.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL· slideslive

Taxonomy

TopicsStock Market Forecasting Methods · Neural Networks and Applications · Neural Networks and Reservoir Computing

MethodsStochastic Gradient Descent