Off-Policy Evaluation for Ranking Policies under Deterministic Logging Policies

Koichi Tanaka; Kazuki Kawamura; Takanori Muroi; Yusuke Narita; Yuki Sasamoto; Kei Tateno; Takuma Udagawa; Wei-Wei Du; Yuta Saito

arXiv:2603.21485·cs.LG·March 24, 2026

Off-Policy Evaluation for Ranking Policies under Deterministic Logging Policies

Koichi Tanaka, Kazuki Kawamura, Takanori Muroi, Yusuke Narita, Yuki Sasamoto, Kei Tateno, Takuma Udagawa, Wei-Wei Du, Yuta Saito

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Click-based Inverse Propensity Score (CIPS), a novel off-policy evaluation method that leverages user click behavior to accurately estimate ranking policy performance even with deterministic logging policies.

Contribution

The paper proposes CIPS, a new estimator that uses click probabilities to enable low-bias off-policy evaluation under deterministic logging policies, overcoming limitations of existing methods.

Findings

01

CIPS achieves significantly lower bias than baseline estimators.

02

Theoretical analysis confirms favorable bias and variance properties.

03

Experimental results demonstrate effectiveness in real-world and synthetic data.

Abstract

Off-Policy Evaluation (OPE) is an important practical problem in algorithmic ranking systems, where the goal is to estimate the expected performance of a new ranking policy using only offline logged data collected under a different, logging policy. Existing estimators, such as the ranking-wise and position-wise inverse propensity score (IPS) estimators, require the data collection policy to be sufficiently stochastic and suffer from severe bias when the logging policy is fully deterministic. In this paper, we propose novel estimators, Click-based Inverse Propensity Score (CIPS), exploiting the intrinsic stochasticity of user click behavior to address this challenge. Unlike existing methods that rely on the stochasticity of the logging policy, our approach uses click probability as a new form of importance weighting, enabling low-bias OPE even under deterministic logging policies where…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- This paper has considered a very practical problem: how to do off-policy evaluation for ranking policies when the data collection (logging) policy is deterministic (or not sufficiently stochastic). This is a practical and important problem in algorithmic ranking systems, such as recommendation systems. - The proposed algorithm, CIPS, is natural and simple. Overall, I think this is an interesting paper, and recommend accepting it.

Weaknesses

- I have concerns about some key math notations in this paper. In particular, notations like $C(a)$, $R(a)$, $C(k)$, and $R(k)$ can be misleading. Specifically, $C(k)$ hints that the click event **only** depends on the position $k$, while $C(a)$ hints that the click event **only** depends on the action $a$. I do not think this is what this paper has assumed. I recommend that the authors clean up such notations. - The idea of this paper is very interesting. However, given the idea and the CIPS a

Reviewer 02Rating 6Confidence 3

Strengths

* The paper's primary strength is proposing an approach to address the Off-Policy Evaluation (OPE) issue when the data-logging policy is fully deterministic, a setting where existing methods fail. * The CIPS estimator method interestingly shifts the source of randomness away from the deterministic policy and instead exploits the intrinsic stochasticity of user click behavior for importance weighting. This approach is shown to "significantly" reduce the severe bias of standard estimators in this

Weaknesses

The following are some of the items that it make the paper better to see them developed/explained/addressed further: * The theoretical unbiasedness of CIPS depends on Condition 3.2, which assumes that the expected potential reward for an item (e.g., a purchase after a click) depends only on that item, not on the other items shown in the ranking. The paper's own experiments show that as this condition is more strongly violated, the Mean Squared Error (MSE) of CIPS, while still the best, does inc

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper is motivated by a well-founded research question — existing OPE methods are primarily developed for stochastic logging policies, which inherently limit their reliability under deterministic settings. This motivation is both sound and clearly articulated. 2. To alleviate the bias commonly induced by the OPE methods under deterministic logging, the paper relaxes the stochasticity assumption of existing OPE methods and takes a novel perspective by leveraging the inherent randomness in

Weaknesses

1. The CIPS method implicitly assumes consistency in the action space between the old and new policies. It remains unclear whether CIPS can maintain its robustness when new actions appear under the new policy. 2. In the synthetic data section, the exact forms of functions appearing in Equations (9) and (10) are not clearly specified. 3. The paper lacks a dedicated related work section. The relation between this work and related work should be discussed. 4. Baselines are limited. More recent bas

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGame Theory and Voting Systems · Information Retrieval and Search Behavior · Recommender Systems and Techniques