Distorted Distributional Policy Evaluation for Offline Reinforcement Learning

Ryo Iwaki; Takayuki Osogami

arXiv:2601.01917·cs.LG·January 6, 2026

Distorted Distributional Policy Evaluation for Offline Reinforcement Learning

Ryo Iwaki, Takayuki Osogami

PDF

Open Access

TL;DR

This paper introduces quantile distortion in offline Distributional Reinforcement Learning to enable non-uniform pessimism, improving value estimation and performance over traditional uniform approaches.

Contribution

It proposes a novel quantile distortion method that adjusts conservatism based on data support, backed by theoretical analysis and empirical validation.

Findings

01

Enhanced performance in offline RL tasks

02

Outperforms uniform pessimism methods

03

Theoretically justified approach

Abstract

While Distributional Reinforcement Learning (DRL) methods have demonstrated strong performance in online settings, its success in offline scenarios remains limited. We hypothesize that a key limitation of existing offline DRL methods lies in their approach to uniformly underestimate return quantiles. This uniform pessimism can lead to overly conservative value estimates, ultimately hindering generalization and performance. To address this, we introduce a novel concept called quantile distortion, which enables non-uniform pessimism by adjusting the degree of conservatism based on the availability of supporting data. Our approach is grounded in theoretical analysis and empirically validated, demonstrating improved performance over uniform pessimism.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques