Distorted Distributional Policy Evaluation for Offline Reinforcement Learning
Ryo Iwaki, Takayuki Osogami

TL;DR
This paper introduces quantile distortion in offline Distributional Reinforcement Learning to enable non-uniform pessimism, improving value estimation and performance over traditional uniform approaches.
Contribution
It proposes a novel quantile distortion method that adjusts conservatism based on data support, backed by theoretical analysis and empirical validation.
Findings
Enhanced performance in offline RL tasks
Outperforms uniform pessimism methods
Theoretically justified approach
Abstract
While Distributional Reinforcement Learning (DRL) methods have demonstrated strong performance in online settings, its success in offline scenarios remains limited. We hypothesize that a key limitation of existing offline DRL methods lies in their approach to uniformly underestimate return quantiles. This uniform pessimism can lead to overly conservative value estimates, ultimately hindering generalization and performance. To address this, we introduce a novel concept called quantile distortion, which enables non-uniform pessimism by adjusting the degree of conservatism based on the availability of supporting data. Our approach is grounded in theoretical analysis and empirically validated, demonstrating improved performance over uniform pessimism.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques
