Offline Policy Learning with Weight Clipping and Heaviside Composite Optimization
Jingren Liu, Hanzhang Qin, Junyi Liu, Mabel C. Chou, Jong-Shi Pang

TL;DR
This paper introduces a novel offline policy learning method that uses weight clipping to reduce variance in policy value estimates, reformulates the optimization as a Heaviside composite problem, and employs integer programming for efficient solution.
Contribution
It develops a weight-clipping estimator for offline policy learning, reformulates the optimization as a Heaviside composite problem, and provides theoretical bounds on suboptimality.
Findings
Weight clipping reduces variance in policy value estimation.
Reformulation as Heaviside composite optimization enables efficient solving.
The proposed method improves policy learning performance theoretically.
Abstract
Offline policy learning aims to use historical data to learn an optimal personalized decision rule. In the standard estimate-then-optimize framework, reweighting-based methods (e.g., inverse propensity weighting or doubly robust estimators) are widely used to produce unbiased estimates of policy values. However, when the propensity scores of some treatments are small, these reweighting-based methods suffer from high variance in policy value estimation, which may mislead the downstream policy optimization and yield a learned policy with inferior value. In this paper, we systematically develop an offline policy learning algorithm based on a weight-clipping estimator that truncates small propensity scores via a clipping threshold chosen to minimize the mean squared error (MSE) in policy value estimation. Focusing on linear policies, we address the bilevel and discontinuous objective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Causal Inference Techniques · Advanced Bandit Algorithms Research · Reinforcement Learning in Robotics
