Logging Policy Design for Off-Policy Evaluation
Connor Douglas, Joel Persson, Foster Provost

TL;DR
This paper investigates how to design logging policies that minimize off-policy evaluation error, balancing reward coverage and variance, with theoretical and practical guidance for recommendation systems.
Contribution
It introduces a unifying framework for logging policy design, deriving optimal policies under various informational regimes and providing practical principles for real-world implementation.
Findings
Characterized the reward-coverage tradeoff in logging policy design.
Derived optimal logging policies for known, unknown, and partially known reward distributions.
Provided practical guidelines for policy design under operational constraints.
Abstract
Off-policy evaluation (OPE) estimates the value of a target treatment policy (e.g., a recommender system) using data collected by a different logging policy. It enables high-stakes experimentation without live deployment, yet in practice accuracy depends heavily on the logging policy used to collect data for computing the estimate. We study how to design logging policies that minimize OPE error for given target policies. We characterize a fundamental reward-coverage tradeoff: concentrating probability mass on high-reward actions reduces variance but risks missing signal on actions the target policy may take. We propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
