Robustly Improving Bandit Algorithms with Confounded and Selection Biased Offline Data: A Causal Approach
Wen Huang, Xintao Wu

TL;DR
This paper introduces a causal framework to improve bandit algorithms using biased offline data, effectively addressing confounding and selection biases to enhance decision-making and reduce regret.
Contribution
It formulates a causal approach to derive bounds that are robust to biases, guiding bandit algorithms to better utilize offline data for near-optimal policies.
Findings
Derived causal bounds effectively guide policy learning.
Incorporating bounds reduces asymptotic regret.
Framework applicable to both contextual and non-contextual bandits.
Abstract
This paper studies bandit problems where an agent has access to offline data that might be utilized to potentially improve the estimation of each arm's reward distribution. A major obstacle in this setting is the existence of compound biases from the observational data. Ignoring these biases and blindly fitting a model with the biased data could even negatively affect the online learning phase. In this work, we formulate this problem from a causal perspective. First, we categorize the biases into confounding bias and selection bias based on the causal structure they imply. Next, we extract the causal bound for each arm that is robust towards compound biases from biased observational data. The derived bounds contain the ground truth mean reward and can effectively guide the bandit agent to learn a nearly-optimal decision policy. We also conduct regret analysis in both contextual and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Cognitive Radio Networks and Spectrum Sensing
