A Unifying View of Coverage in Linear Off-Policy Evaluation

Philip Amortila; Audrey Huang; Akshay Krishnamurthy; Nan Jiang

arXiv:2601.19030·cs.LG·January 28, 2026

A Unifying View of Coverage in Linear Off-Policy Evaluation

Philip Amortila, Audrey Huang, Akshay Krishnamurthy, Nan Jiang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a unified framework for understanding coverage in linear off-policy evaluation in reinforcement learning, providing a new coverage parameter that generalizes previous notions and offers tighter finite-sample guarantees.

Contribution

It proposes a novel coverage parameter called feature-dynamics coverage, unifying various existing definitions and enabling a comprehensive analysis of linear OPE algorithms.

Findings

01

Introduces feature-dynamics coverage parameter.

02

Provides finite-sample error bounds based on this new coverage.

03

Recovers classical coverage notions under additional assumptions.

Abstract

Off-policy evaluation (OPE) is a fundamental task in reinforcement learning (RL). In the classic setting of linear OPE, finite-sample guarantees often take the form $Evaluation error \leq poly (C^{π}, d, 1/ n, lo g (1/ δ)),$ where $d$ is the dimension of the features and $C^{π}$ is a coverage parameter that characterizes the degree to which the visited features lie in the span of the data distribution. While such guarantees are well-understood for several popular algorithms under stronger assumptions (e.g. Bellman completeness), the understanding is lacking and fragmented in the minimal setting where only the target value function is linearly realizable in the features. Despite recent interest in tight characterizations of the statistical rate in this setting, the right notion of coverage remains unclear, and candidate definitions from prior analyses have…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- Use $Z=\phi(s,a)$ as an instrumental variables to solve the "error in variables", which is induced by $X=\phi(s,a)-\gamma\,\phi(s',a')$, yielding a finite-sample value bound. - The proposed feature-dynamics coverage resolves key deficiencies of prior metrics, by ensuring scale-invariance and meaningful characterization under general off-policy distributions. - The new definition of coverage via Proposition 1 is elegant, interpretable, and enables unification of various existing notio

Weaknesses

1. The motivation for key constructions appears late, making the early sections harder to follow. 2. The paper could better distinguish the roles of Theorem 1 and Proposition 1 to clarify the main message.

Reviewer 02Rating 6Confidence 3

Strengths

1. The proofs are checked to be mathematically sound. 2. The perspective of analysis looks new to me. 3. Section 5 is appreciated since it delivers very clear messages on how to make sense of the newly defined parameter, as well as providing a good collection of equivalence results with existing parameters.

Weaknesses

1. The so-called ``IV perspective'' that inspires the new results confuses me a bit. * As far as I'm concerned, in a linear model $Y = X^{\top} \theta + \epsilon$, IV is only necessary when $X$ and $\epsilon$ are not independent. Speaking of intuitions, I don't see why it should be the case here. * It is also a little confusing to refer to Eq. (7) as the linear regression problem, since linear regression shouldn't come with the $\mathbb{E}$, but rather, with observable individual data po

Reviewer 03Rating 4Confidence 3

Strengths

- The paper introduces the feature-dynamics coverage parameter $C_\phi^\pi$, providing a unified perspective on coverage in linear off-policy evaluation (OPE). Derived from an IV view of the LSTDQ algorithm, $C_\phi^\pi$ quantifies how well features induced by the behavior policy capture the subspace relevant to the target policy. It interprets coverage as occurring within a feature-compressed MDP, linking the environment’s dynamics with the feature representation and offering a scale-invariant,

Weaknesses

1. The paper focuses on the linear function approximation setting, assuming $Q_\pi(s,a) = \phi(s,a)^\top \theta^\star$. This assumption enables a clean finite-sample analysis of the LSTDQ estimator and the introduction of the coverage parameter in Equation (13). However, the framework relies on the invertibility of $\Sigma$ and $A$ and applies only to the linear regime. Recent work in off-policy evaluation has advanced toward general function approximation via eluder dimension, where representat

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms · Advanced Bandit Algorithms Research