On Transportability for Structural Causal Bandits
Min Woo Park, Sanghack Lee

TL;DR
This paper explores how to transfer causal knowledge across different environments to improve decision-making in structural causal bandits, achieving better learning efficiency and regret bounds.
Contribution
It introduces a framework for transportability in structural causal bandits, leveraging invariances across environments to enhance learning from heterogeneous data.
Findings
The proposed algorithm achieves sub-linear regret with explicit dependence on prior data quality.
Transportability can outperform standard bandit methods relying only on online data.
Invariance exploitation across environments improves the efficiency of causal bandit learning.
Abstract
Intelligent agents equipped with causal knowledge can optimize their action spaces to avoid unnecessary exploration. The structural causal bandit framework provides a graphical characterization for identifying actions that are unable to maximize rewards by leveraging prior knowledge of the underlying causal structure. While such knowledge enables an agent to estimate the expected rewards of certain actions based on others in online interactions, there has been little guidance on how to transfer information inferred from arbitrary combinations of datasets collected under different conditions -- observational or experimental -- and from heterogeneous environments. In this paper, we investigate the structural causal bandit with transportability, where priors from the source environments are fused to enhance learning in the deployment setting. We demonstrate that it is possible to exploit…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Introduction to the causality elements is complete and well-compressed, and I appreciate the difficulty in making such a causality-rich paper appealing to a general audience. The problem is interesting and applying bandit ideas to the transportability literature has a lot of potential.
First, it was difficult to know what was the contribution of the paper from the text; a lot is introduced, but the novel ideas are not clear (or at least what is novel and what is not is unclear). For example, the paper does not make clear that the definition of transportability in bandits is not novel. It has been studied before under the same notion of regret (e.g. in Bellot, et al., 2023). Second, it is hard to judge the size of the contribution. The paper combines existing ideas on existin
- The presentation is good with intuitive examples. - Experiment sections show the proposed approach surpasses baselines, which have no information from other domains.
I find the paper somewhat incremental, and the contribution appears to be relatively limited in scope. - The transfer learning algorithm design idea is the same as Zhang and Bareinboim (2017), where they use the observational distribution to bound the expected reward. The UCB bandit approach is also the same. In Zhang and Bareinboim (2017), the Thompson sampling approach is also applied; however, it is not utilized in this paper. - To reduce the action space, the paper applies the POMIS approa
Please find the strengths below: 1. The paper integrates causal transportability theory with the structural causal bandit framework, enabling the use of information observed in other SCMs to help estimate the effects of interventions. This represents a creative and unconventional approach to improving online learning efficiency. 2. The theoretical analysis is rigorous: the paper formally derives the conditions for transferability and establishes a sub-linear regret bound that depends explicitly
Please find the weaknesses below: 1. The approach requires full knowledge of the causal and selection diagrams to determine transportability, which is often unrealistic in practice. The proposed TC-UCB mainly extends causal UCB by adding transportable priors, so its algorithmic novelty is limited. 2. Theoretical analysis does not specify worst-case dependence on the graph size or action space. When the number of interventions is large, the regret may scale exponentially, and there is no discussi
1. Hierarchical Dominance Relations in Action Spaces: The paper derives dominance bounds for expected rewards, such as $E P^_{x^\star} Y \leq E P^_{r^\star} Y$ where $r^\star$ is optimal under weaker constraints, allowing efficient pruning of non-optimal actions like non-POIS sets without full exploration. 2. Transportability for Causal Bounds: It introduces a method using c-factors $Q^*[C_q]$ to bound non-transportable expected rewards, e.g., $\ell = Q^i[C']$ and $u = Q^i[C'] + 1 - \sum_c Q^i[C
1. Lack of Regret Lower Bound: The paper provides an upper bound on cumulative regret but does not derive a corresponding lower bound, limiting assessment of the algorithm's theoretical optimality in the structural causal bandit setting. 2. Straightforward Combination of Frameworks: The core model integrates structural causal models (SCMs) with transportability via selection diagrams in a direct manner, where dominance relations largely extend POMIS definitions without introducing fundamentally
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Gaussian Processes and Bayesian Inference
