Doubly Optimal Policy Evaluation for Reinforcement Learning
Shuze Daniel Liu, Claire Chen, Shangtong Zhang

TL;DR
This paper introduces a doubly optimal policy evaluation method for reinforcement learning that minimizes variance and improves accuracy by optimally combining data collection and processing strategies, both theoretically and empirically.
Contribution
It proposes a novel doubly optimal policy evaluation approach that guarantees lower variance and unbiased estimates compared to existing methods.
Findings
Reduces variance significantly over previous methods
Achieves unbiased and more accurate policy evaluation
Demonstrates superior empirical performance in experiments
Abstract
Policy evaluation estimates the performance of a policy by (1) collecting data from the environment and (2) processing raw data into a meaningful estimate. Due to the sequential nature of reinforcement learning, any improper data-collecting policy or data-processing method substantially deteriorates the variance of evaluation results over long time steps. Thus, policy evaluation often suffers from large variance and requires massive data to achieve the desired accuracy. In this work, we design an optimal combination of data-collecting policy and data-processing baseline. Theoretically, we prove our doubly optimal policy evaluation method is unbiased and guaranteed to have lower variance than previously best-performing methods. Empirically, compared with previous works, we show our method reduces variance substantially and achieves superior empirical performance.
Peer Reviews
Decision·ICLR 2025 Poster
1. It is an interesting problem to study that to evaluate a given policy, what is the best behavior policy to collect samples.
The pipeline of their proposed method is that: First, calculating an optimal behavior policy and baseline function by solving an optimization problem; Second, using the derived behavior policy to collect samples and use importance-weight estimator on these samples. 1. However, in the first stage, in order to calculate the optimal behavior policy and baseline function, lots of samples are needed and I think the samples needed in this stage are much more than the second stage (since the definitio
+ The paper is quite well-written and easy to follow. The motivation and the key idea of the method are clearly presented. The differences between the proposed method and prior work have been clearly discussed in the related work section. + The authors provide a detailed and comprehensive theoretical proof to support the proposed method, though there seem to be some errors in the proof of theorem 1. + The experiments show performance improvement over existing methods.
- A potential error in the proof: In equation (23), the second equal sign does not make sense. By equation (12) that ${\mu}^{*}_t (a|s) \propto \pi _ t (a|s) \sqrt{u _ { \pi, t} (s,a)}$. We should have $\sum_{a} \frac{\pi _ t(a|S_t)^2}{\mu^{*}_t (a|S_t)} u _ {\pi, t}(S_t,a)$ $= \sum_{a} \pi _ t (a|S_t)\sqrt{u _ {\pi, t} (S_t, a)}$ However, in the paper the author seems to give the wrong result as $\sum_{a} \frac{\pi _ t(a|S_t)^2}{\mu^{*}_t (a|S_t)} u _ {\pi, t}(S_t,a)$ $=\sum_{a}\pi_t
The related work section is comprehensive, contextualizing the paper within current literature and algorithms. The paper is generally clear and accessible, a crucial feature given the depth of theoretical content. The theoretical analysis rigorously establishes DOPT's optimality over on-policy, DR, and ODI baselines.
While optimal theoretically, in practice DOPT requires off-policy estimation of (1) the Q-function q(s,a), (2) the next state value variance \ni(s, a) and (3) the amplification factor u(s,a) (how much should the probability of an action be amplified for a given s,a pair). Specifically, u is learned based on a Bellman operator with \ni as a reward and with an important sampling correction term (Lemma 3). Both \ni and u are not required by previous schemes: DR and ODI and both can introduce errors
Videos
Taxonomy
TopicsElevator Systems and Control · Adaptive Dynamic Programming Control
