TL;DR
This paper introduces CPQL, a conservative multi-step offline RL algorithm that leverages Peng's Q(λ) operator for better value estimation, outperforming existing methods and aiding offline-to-online learning.
Contribution
It is the first to demonstrate the effectiveness of conservative multi-step value estimation in offline RL both theoretically and empirically.
Findings
CPQL outperforms existing offline single-step baselines on D4RL.
The fixed point of PQL is closer to the behavior policy's value function.
Pre-trained Q-functions with CPQL improve online fine-tuning stability.
Abstract
We propose a model-free offline multi-step reinforcement learning (RL) algorithm, Conservative Peng's Q() (CPQL). Our algorithm adapts the Peng's Q() (PQL) operator for conservative value estimation as an alternative to the Bellman operator. To the best of our knowledge, this is the first work in offline RL to theoretically and empirically demonstrate the effectiveness of conservative value estimation with a \textit{multi-step} operator by fully leveraging offline trajectories. The fixed point of the PQL operator in offline RL lies closer to the value function of the behavior policy, thereby naturally inducing implicit behavior regularization. CPQL simultaneously mitigates over-pessimistic value estimation, achieves performance greater than (or equal to) that of the behavior policy, and provides near-optimal performance guarantees -- a milestone that previous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
