Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning

Byeongchan Kim; Min-hwan Oh

arXiv:2605.14779·cs.LG·May 15, 2026

Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning

Byeongchan Kim, Min-hwan Oh

PDF

1 Repo 1 Video

TL;DR

This paper introduces CPQL, a conservative multi-step offline RL algorithm that leverages Peng's Q(λ) operator for better value estimation, outperforming existing methods and aiding offline-to-online learning.

Contribution

It is the first to demonstrate the effectiveness of conservative multi-step value estimation in offline RL both theoretically and empirically.

Findings

01

CPQL outperforms existing offline single-step baselines on D4RL.

02

The fixed point of PQL is closer to the behavior policy's value function.

03

Pre-trained Q-functions with CPQL improve online fine-tuning stability.

Abstract

We propose a model-free offline multi-step reinforcement learning (RL) algorithm, Conservative Peng's Q( $λ$ ) (CPQL). Our algorithm adapts the Peng's Q( $λ$ ) (PQL) operator for conservative value estimation as an alternative to the Bellman operator. To the best of our knowledge, this is the first work in offline RL to theoretically and empirically demonstrate the effectiveness of conservative value estimation with a \textit{multi-step} operator by fully leveraging offline trajectories. The fixed point of the PQL operator in offline RL lies closer to the value function of the behavior policy, thereby naturally inducing implicit behavior regularization. CPQL simultaneously mitigates over-pessimistic value estimation, achieves performance greater than (or equal to) that of the behavior policy, and provides near-optimal performance guarantees -- a milestone that previous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

oh-lab/CPQL
github

Videos

Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning· slideslive