Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL
Yang Yue, Rui Lu, Bingyi Kang, Shiji Song, Gao Huang

TL;DR
This paper investigates the root cause of Q-value divergence in offline RL, introduces a NTK-based metric to predict divergence, and proposes LayerNorm to improve stability and performance.
Contribution
It identifies self-excitation as the main cause of divergence, develops a predictive NTK-based measure, and demonstrates LayerNorm as an effective architectural solution.
Findings
The SEEM metric predicts divergence early in training.
LayerNorm effectively prevents divergence without bias.
The method achieves state-of-the-art results on challenging offline RL tasks.
Abstract
The divergence of the Q-value estimation has been a prominent issue in offline RL, where the agent has no access to real dynamics. Traditional beliefs attribute this instability to querying out-of-distribution actions when bootstrapping value targets. Though this issue can be alleviated with policy constraints or conservative Q estimation, a theoretical understanding of the underlying mechanism causing the divergence has been absent. In this work, we aim to thoroughly comprehend this mechanism and attain an improved solution. We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. Then, we propose a novel Self-Excite Eigenvalue Measure (SEEM) metric based on Neural Tangent Kernel (NTK) to measure the evolving property of Q-network at training, which provides an intriguing explanation of the emergence of divergence.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsStock Market Forecasting Methods · Neural Networks and Applications · Neural Networks and Reservoir Computing
MethodsStochastic Gradient Descent
