Semi-supervised Batch Learning From Logged Data

Gholamali Aminian; Armin Behnamnia; Roberto Vega; Laura Toni,; Chengchun Shi; Hamid R. Rabiee; Omar Rivasplata; Miguel R. D. Rodrigues

arXiv:2209.07148·cs.LG·February 20, 2024

Semi-supervised Batch Learning From Logged Data

Gholamali Aminian, Armin Behnamnia, Roberto Vega, Laura Toni,, Chengchun Shi, Hamid R. Rabiee, Omar Rivasplata, Miguel R. D. Rodrigues

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a semi-supervised batch learning approach from logged data with missing feedback, leveraging a novel risk bound to improve policy learning performance in off-policy settings.

Contribution

It proposes a new regularized semi-supervised learning method that effectively utilizes missing-feedback data using a novel risk bound within the counterfactual risk minimization framework.

Findings

01

Improved policy performance over logging policies.

02

Effective use of missing-feedback data in learning.

03

Validated on benchmark datasets.

Abstract

Off-policy learning methods are intended to learn a policy from logged data, which includes context, action, and feedback (cost or reward) for each sample point. In this work, we build on the counterfactual risk minimization framework, which also assumes access to propensity scores. We propose learning methods for problems where feedback is missing for some samples, so there are samples with feedback and samples missing-feedback in the logged data. We refer to this type of learning as semi-supervised batch learning from logged data, which arises in a wide range of application domains. We derive a novel upper bound for the true risk under the inverse propensity score estimator to address this kind of learning problem. Using this bound, we propose a regularized semi-supervised batch learning method with logged data where the regularization term is feedback-independent and, as a result,…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1, The idea is straightforward and well-motivated. 2, Theorem 1 bound the IPS reward on all samples by the known reward samples, which build the foundation of their proposed algorithm. 3, Experiments result of the neural network policy is good and the improvement compared to the baseline is huge.

Weaknesses

1, The paper writing is not clear enough. 2, The proposed method lacks novelty. This is a policy constraint method. 3, Doesn't explain what's the advantages of their method when dealing with the missing reward samples. 4, The writing of introduction only introduces the background and doesn't include the motivation of their method. Also, the logic of the intro and the related work is not good enough. For example, in section 5, the Pseudo-labeling algorithm part is confusing, which makes me belie

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

Batch off-policy learning is a common and important problem in practice. The authors identify a particularly interesting setting where rewards are missing from the responses. The use of the importance sampling weights is well motivated by noting the relationship between the importance weights and the risk of the respective policies. The use of pseudo labels is also an interesting approach.

Weaknesses

My main issue with this work is that the presentation is such that it is difficult to parse the contribution of the work. The authors place great emphasis on their result on relating the risks to the KL-divergence (an other Bregman divergences), but then also introduce methodology without a lot of details. It's also not entirely clear to me which assumptions are being employed here in order to make the method applicable. The authors note that access to the true logging policy, however it would a

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- This paper is well-written and easy to follow. The theoretical definitions and most of the related works are clearly explained. - The approach and the derived upper bound for the true risk are original. - The experiments are enough to demonstrate their approach.

Weaknesses

- Although the derived upper bound is new, the approach for the learning algorithm to leverage this bound is quite straightforward. In addition, the practical algorithm for their approach is lack of a theoretical guarantee (i.e., an upper bound for the regret). - The experiments only include two datasets. It would be better to add a real-life example in the bandit setting.

Reviewer 04Rating 8· accept, good paperConfidence 4

Strengths

1. This paper contributes to a significant research question, i.e., policy learning from logged data. It contributes to the existing literature by introducing a KL-regularization framework, and proposing to estimate KL with unlabeled data (while still being powerful when KL is in-sample estimated). 2. Originality. The idea of regularization is not new (in fact there is some missing literature I'll discuss in the Weaknesses section), but the method presented here contains new techniques and con

Weaknesses

1. Related literature. In general this paper does a good job of relating to existing literature, but the idea of penalizing with KL divergence is closely related to the literature on "Pessimism" for policy learning, e.g., [1] in offline RL, where value estimation is penalized with uncertainty estimation (related to how close a target policy is to the behavior policy), or [2] which is closer to the batched policy learning problem. The point is that by penalization and optimizing a lower confide

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReservoir Engineering and Simulation Methods