Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol

Pai Liu; Lingfeng Zhao; Shivangi Agarwal; Jinghan Liu; Audrey Huang; Philip Amortila; Nan Jiang

arXiv:2502.08021·cs.LG·October 27, 2025

Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol

Pai Liu, Lingfeng Zhao, Shivangi Agarwal, Jinghan Liu, Audrey Huang, Philip Amortila, Nan Jiang

PDF

Open Access

TL;DR

This paper introduces new algorithms and an experimental protocol for hyperparameter tuning in off-policy evaluation in offline reinforcement learning, addressing variance issues and improving evaluation stability.

Contribution

It develops novel model-free and model-based selectors with theoretical guarantees and proposes a new protocol for more stable and comprehensive evaluation.

Findings

01

LSTD-Tournament outperforms existing methods in experiments

02

New protocol enables better control and evaluation of candidate models

03

Proposed selectors demonstrate promising empirical results

Abstract

Holdout validation and hyperparameter tuning from data is a long-standing problem in offline reinforcement learning (RL). A standard framework is to use off-policy evaluation (OPE) methods to evaluate and select the policies, but OPE either incurs exponential variance (e.g., importance sampling) or has hyperparameters on their own (e.g., FQE and model-based). We focus on hyperparameter tuning for OPE itself, which is even more under-investigated. Concretely, we select among candidate value functions ("model-free") or dynamics ("model-based") to best assess the performance of a target policy. Concretely, we select among candidate value functions (``model-free'') or dynamics models (``model-based'') to best assess the performance of a target policy. We develop: (1) new model-free and model-based selectors with theoretical guarantees, and (2) a new experimental protocol for empirically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Causal Inference Techniques · Advanced Bandit Algorithms Research

MethodsFocus