Off-policy Evaluation with Deeply-abstracted States
Meiling Hao, Pingfan Su, Liyuan Hu, Zoltan Szabo, Qingyuan Zhao and, Chengchun Shi

TL;DR
This paper introduces a novel approach to off-policy evaluation by creating deeply-abstracted states that reduce complexity and improve accuracy in large state spaces, supported by theoretical guarantees.
Contribution
It defines irrelevance conditions for state abstractions in OPE, proposes an iterative projection method for deep abstraction, and proves Fisher consistency of OPE estimators on these abstractions.
Findings
Deeply-abstracted states simplify OPE in large state spaces.
The method reduces sample complexity significantly.
Fisher consistency is established for OPE estimators on abstracted states.
Abstract
Off-policy evaluation (OPE) is crucial for assessing a target policy's impact offline before its deployment. However, achieving accurate OPE in large state spaces remains challenging. This paper studies state abstractions -- originally designed for policy learning -- in the context of OPE. Our contributions are three-fold: (i) We define a set of irrelevance conditions central to learning state abstractions for OPE, and derive a backward-model-irrelevance condition for achieving irrelevance in %sequential and (marginalized) importance sampling ratios by constructing a time-reversed Markov decision process (MDP). (ii) We propose a novel iterative procedure that sequentially projects the original state space into a smaller space, resulting in a deeply-abstracted state, which substantially simplifies the sample complexity of OPE arising from high cardinality. (iii) We prove the Fisher…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The irrelevance conditions proposed in [1] are adapted well to impose irrelevance of components of OPE methods (like the IS and MIS ratios). - The final result, that OPE methods can be applied directly to the abstract MDP obtained after the two-step abstraction learning process, is a useful result. Effective two-step abstraction learning methods can make OPE practically applicable for problems with high dimensional states. --- [1] Li, Lihong, Thomas J. Walsh, and Michael L. Littman. "Towards
- Fisher consistency: The statement of the results claim to show Fisher consistency of the corresponding abstract-state estimators, however, the proofs are for unbiasedness and identifiability. It is unclear whether these two conditions imply Fisher consistency. - Experiments: - [L511] states that the reported metrics are MSE and absolute bias, however, Figure 5 reports relative MSE and relative absolute bias. The latter has not been defined. - The abstraction learning itself requires some a
1. The authors propose the novel idea of backwards model irrelevance. 2. The paper is well organized, and provides a lot of background and explanations.
1. Despite the authors' attempts at making things clearer, I didn't fully understand the notion of backwards model irrelevance, and why it would generally require more than one iteration with the forward model irrelevance. I think the paper would benefit from making this point clearer somehow. I found the examples on page 9 more confusing than helpful. 2. The example in the experiments looks very artificial. While it does show improved performance for the author's method, it being tailored to t
- The paper nicely unifies several ideas such as referencing prior work on model-irrelevant and pi-irrelevant abstractions. And also relating its backward model to prior work that learns an inverse dynamics model. - The results are for a wide class of OPE estimators compared to prior work that discusses OPE and abstractions only for a particular OPE estimator. - The work tackles a problem that has received relatively little attention.
- The motivation of the paper is unclear. There is a possibly interesting idea here in the backward model, but it is unclear what the purpose of this model is. For example, Lemma 1 suggests that the previous irrelevance conditions are sufficient for consistent OPE. So it is unclear to me why we need an alternative condition to also get consistency. From theorem 1, I see that the backward model also gives us two additional irrelevance conditions, but why do we need these if all we care about is c
The paper applies state abstraction, which has been widely studied for policy learning. This approach has the potential to significantly reduce state space cardinality and improve the accuracy of OPE estimators. The proposed iterative procedure for generating deeply abstracted states is a creative solution. It simplifies the sample complexity of OPE by shrinking the state space without compromising the representational capacity for policy evaluation. The method supports multiple types of OPE est
The iterative nature of the abstraction process, while powerful, might introduce computational overhead that is not fully addressed. Although the paper validates the approach with numerical experiments, the experimental section seems underdeveloped. There is a lack of detailed comparisons with baseline methods, especially in real-world scenarios. It would be helpful to see more empirical results across diverse datasets to better understand the practical performance of deeply abstracted states.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvaluation and Performance Assessment · International Development and Aid
MethodsSparse Evolutionary Training
