Doubly Optimal Policy Evaluation for Reinforcement Learning

Shuze Daniel Liu; Claire Chen; Shangtong Zhang

arXiv:2410.02226·cs.LG·March 21, 2025

Doubly Optimal Policy Evaluation for Reinforcement Learning

Shuze Daniel Liu, Claire Chen, Shangtong Zhang

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces a doubly optimal policy evaluation method for reinforcement learning that minimizes variance and improves accuracy by optimally combining data collection and processing strategies, both theoretically and empirically.

Contribution

It proposes a novel doubly optimal policy evaluation approach that guarantees lower variance and unbiased estimates compared to existing methods.

Findings

01

Reduces variance significantly over previous methods

02

Achieves unbiased and more accurate policy evaluation

03

Demonstrates superior empirical performance in experiments

Abstract

Policy evaluation estimates the performance of a policy by (1) collecting data from the environment and (2) processing raw data into a meaningful estimate. Due to the sequential nature of reinforcement learning, any improper data-collecting policy or data-processing method substantially deteriorates the variance of evaluation results over long time steps. Thus, policy evaluation often suffers from large variance and requires massive data to achieve the desired accuracy. In this work, we design an optimal combination of data-collecting policy and data-processing baseline. Theoretically, we prove our doubly optimal policy evaluation method is unbiased and guaranteed to have lower variance than previously best-performing methods. Empirically, compared with previous works, we show our method reduces variance substantially and achieves superior empirical performance.

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. It is an interesting problem to study that to evaluate a given policy, what is the best behavior policy to collect samples.

Weaknesses

The pipeline of their proposed method is that: First, calculating an optimal behavior policy and baseline function by solving an optimization problem; Second, using the derived behavior policy to collect samples and use importance-weight estimator on these samples. 1. However, in the first stage, in order to calculate the optimal behavior policy and baseline function, lots of samples are needed and I think the samples needed in this stage are much more than the second stage (since the definitio

Reviewer 02Rating 5Confidence 3

Strengths

+ The paper is quite well-written and easy to follow. The motivation and the key idea of the method are clearly presented. The differences between the proposed method and prior work have been clearly discussed in the related work section. + The authors provide a detailed and comprehensive theoretical proof to support the proposed method, though there seem to be some errors in the proof of theorem 1. + The experiments show performance improvement over existing methods.

Weaknesses

- A potential error in the proof: In equation (23), the second equal sign does not make sense. By equation (12) that ${\mu}^{*}_t (a|s) \propto \pi _ t (a|s) \sqrt{u _ { \pi, t} (s,a)}$. We should have $\sum_{a} \frac{\pi _ t(a|S_t)^2}{\mu^{*}_t (a|S_t)} u _ {\pi, t}(S_t,a)$ $= \sum_{a} \pi _ t (a|S_t)\sqrt{u _ {\pi, t} (S_t, a)}$ However, in the paper the author seems to give the wrong result as $\sum_{a} \frac{\pi _ t(a|S_t)^2}{\mu^{*}_t (a|S_t)} u _ {\pi, t}(S_t,a)$ $=\sum_{a}\pi_t

Reviewer 03Rating 8Confidence 4

Strengths

The related work section is comprehensive, contextualizing the paper within current literature and algorithms. The paper is generally clear and accessible, a crucial feature given the depth of theoretical content. The theoretical analysis rigorously establishes DOPT's optimality over on-policy, DR, and ODI baselines.

Weaknesses

While optimal theoretically, in practice DOPT requires off-policy estimation of (1) the Q-function q(s,a), (2) the next state value variance \ni(s, a) and (3) the amplification factor u(s,a) (how much should the probability of an action be amplified for a given s,a pair). Specifically, u is learned based on a Bellman operator with \ni as a reward and with an important sampling correction term (Lemma 3). Both \ni and u are not required by previous schemes: DR and ODI and both can introduce errors

Videos

Doubly Optimal Policy Evaluation for Reinforcement Learning· slideslive

Taxonomy

TopicsElevator Systems and Control · Adaptive Dynamic Programming Control