A General Framework for Off-Policy Learning with Partially-Observed Reward
Rikiya Takehi, Masahiro Asami, Kosuke Kawakami, Yuta Saito

TL;DR
This paper introduces HyPeR, a novel framework for off-policy learning that effectively leverages secondary rewards to improve policy optimization when target rewards are only partially observed, with strong theoretical and empirical support.
Contribution
The work proposes a new method, HyPeR, that utilizes secondary rewards to enhance off-policy learning under partial target reward observation, and provides theoretical analysis and empirical validation.
Findings
HyPeR outperforms existing methods in synthetic data scenarios.
Leveraging secondary rewards improves target reward optimization.
Jointly optimizing target and secondary rewards can be beneficial.
Abstract
Off-policy learning (OPL) in contextual bandits aims to learn a decision-making policy that maximizes the target rewards by using only historical interaction data collected under previously developed policies. Unfortunately, when rewards are only partially observed, the effectiveness of OPL degrades severely. Well-known examples of such partial rewards include explicit ratings in content recommendations, conversion signals on e-commerce platforms that are partial due to delay, and the issue of censoring in medical problems. One possible solution to deal with such partial rewards is to use secondary rewards, such as dwelling time, clicks, and medical indicators, which are more densely observed. However, relying solely on such secondary rewards can also lead to poor policy learning since they may not align with the target reward. Thus, this work studies a new and general problem of OPL…
Peer Reviews
Decision·ICLR 2025 Poster
The paper has the following strengths: + The paper provides an unbiased estimator for the value based on secondary information and prove that the expected gradient is the same as the one without considering secondary information ($\beta=0$). + The authors formally prove the variance reduction due to the introduction of the secondary information and show the reduction exists if the policy learned by the secondary information is better than the policy only based on the partially observed reward.
The paper can be improved in the following aspects. - It is common to consider a weighted combination of the original and secondary information in machine learning algorithms. Similar strategies are used in few-shot learning [2], informed learning [1], etc. The authors need to compare the proposed algorithm with them and emphasize their uniqueness. - The paper considers the setting with some types of reinforcement learning settings and secondary information, and the authors only consider polic
- The paper is well written and is easy to follow. - The problem of partially observed/delayed reward is of high importance. - The formulation of the data-generation process is simple and can model diverse problems.
- The contribution of the paper is twofold, first, the introduction of the framework of partially-observed rewards with presence of secondary rewards, that comes with a new data-generation-process (DGP), and secondly, the derivation of an unbiased policy gradient that can efficiently leverage all information to optimise the target. These two main contributions present some weaknesses: + The framework's data generation process (DGP) is **restrictive and was not properly tested**: If the DGP mo
Incorporating auxiliary variables for variance reduction is a well-established idea in the ML community. However, the authors innovatively integrate this concept with the doubly robust estimator to tackle the challenge of limited data coverage under partially observable rewards—a contribution that appears novel. The experiments further support the theoretical insights, and the paper is generally clear and easy to follow, aside from a few errors and typos.
I noticed several potential technical issues (or possibly typos) that may impact the rigor of the results: 1. In eqn (6), you write down the distribution in a way that allow $r_i$ to also depend on $o_i$. This will cause the problem of confounding bias, which does not give rise to an unbiased estimator of $q(x, a)$. See Kato et al. 2021 or Pearl, 2009 for details. Is this a typo or is there something missing? I found in the proof of Theorem 1 that you indeed assume $r$ to be independent of $o$
Videos
Taxonomy
TopicsAuction Theory and Applications · Economic Policies and Impacts · Supply Chain and Inventory Management
MethodsALIGN
