Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning

Alexander W. Goodall; Edwin Hamel-De le Court; Francesco Belardinelli

arXiv:2511.10843·cs.LG·January 6, 2026

Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning

Alexander W. Goodall, Edwin Hamel-De le Court, Francesco Belardinelli

PDF

Open Access 1 Video

TL;DR

This paper introduces a method to lower variance in return estimates for off-policy reinforcement learning by designing behavior policies, leading to improved sample efficiency and stability in policy-gradient algorithms.

Contribution

It extends recent off-policy evaluation results to online RL, showing how well-designed behavior policies can reduce variance and enhance learning efficiency.

Findings

01

Lower variance return estimates improve policy-gradient performance.

02

Enhanced sample efficiency observed across multiple environments.

03

The approach outperforms traditional methods in stability and learning speed.

Abstract

Many reinforcement learning algorithms, particularly those that rely on return estimates for policy improvement, can suffer from poor sample efficiency and training instability due to high-variance return estimates. In this paper we leverage new results from off-policy evaluation; it has recently been shown that well-designed behaviour policies can be used to collect off-policy data for provably lower variance return estimates. This result is surprising as it means collecting data on-policy is not variance optimal. We extend this key insight to the online reinforcement learning setting, where both policy evaluation and improvement are interleaved to learn optimal policies. Off-policy RL has been well studied (e.g., IMPALA), with correct and truncated importance weighted samples for de-biasing and managing variance appropriately. Generally these approaches are concerned with reconciling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning· underline

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Causal Inference Techniques · Domain Adaptation and Few-Shot Learning