Quantile Q-Learning: Revisiting Offline Extreme Q-Learning with Quantile Regression
Xinming Gao, Shangzhe Li, Yujin Cai, Wenwu Yu

TL;DR
This paper introduces a stable, hyperparameter-efficient offline RL method using quantile regression to improve Extreme Q-Learning, achieving competitive results across benchmarks.
Contribution
It proposes a novel approach to estimate the temperature coefficient via quantile regression and introduces value regularization for stable training in offline RL.
Findings
Achieves competitive or superior performance on D4RL and NeoRL2 benchmarks.
Maintains stable training dynamics with a consistent hyperparameter set.
Reduces the need for extensive hyperparameter tuning in offline RL.
Abstract
Offline reinforcement learning (RL) enables policy learning from fixed datasets without further environment interaction, making it particularly valuable in high-risk or costly domains. Extreme -Learning (XQL) is a recent offline RL method that models Bellman errors using the Extreme Value Theorem, yielding strong empirical performance. However, XQL and its stabilized variant MXQL suffer from notable limitations: both require extensive hyperparameter tuning specific to each dataset and domain, and also exhibit instability during training. To address these issues, we proposed a principled method to estimate the temperature coefficient via quantile regression under mild assumptions. To further improve training stability, we introduce a value regularization technique with mild generalization, inspired by recent advances in constrained value learning. Experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
