Contextual Bandits with Latent Confounders: An NMF Approach
Rajat Sen, Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G., Dimakis, and Sanjay Shakkottai

TL;DR
This paper introduces an NMF-based approach to causal stochastic contextual bandits with latent confounders, achieving improved regret bounds by exploiting low-dimensional structure in the reward matrix.
Contribution
It proposes a novel $ ext{NMF}$-based algorithm for contextual bandits with latent confounders, providing the first regret guarantees for online matrix completion with bandit feedback when rank exceeds one.
Findings
Achieves regret of $ ilde{O}(L ext{poly}(m, ext{log} K) ext{log} T$
Outperforms conventional bandit algorithms in experiments
Provides theoretical lower bounds matching upper bounds under mild conditions
Abstract
Motivated by online recommendation and advertising systems, we consider a causal model for stochastic contextual bandits with a latent low-dimensional confounder. In our model, there are observed contexts and arms of the bandit. The observed context influences the reward obtained through a latent confounder variable with cardinality (). The arm choice and the latent confounder causally determines the reward while the observed context is correlated with the confounder. Under this model, the mean reward matrix (for each context in and each arm in ) factorizes into non-negative factors () and (). This insight enables us to propose an -greedy NMF-Bandit algorithm that designs a sequence of interventions (selecting specific arms), that achieves a balance between learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Reinforcement Learning in Robotics
