When Shift Happens - Confounding Is to Blame
Abbavaram Gowtham Reddy, Celia Rubio-Madrigal, Rebekka Burkholz, Krikamol Muandet

TL;DR
This paper investigates how hidden confounding affects distribution shifts in machine learning, revealing that models incorporating confounder proxies and environment-specific relationships can improve out-of-distribution robustness.
Contribution
It provides empirical and theoretical evidence that hidden confounding explains OOD challenges and proposes methods to mitigate these effects through confounder proxies and environment-specific modeling.
Findings
ERM can outperform OOD methods under certain shifts
Using all covariates, not just causal ones, can improve OOD performance
Models with confounder proxies help mitigate confounding shifts
Abstract
Distribution shifts introduce uncertainty that undermines the robustness and generalization capabilities of machine learning models. While conventional wisdom suggests that learning causal-invariant representations enhances robustness to such shifts, recent empirical studies present a counterintuitive finding: (i) empirical risk minimization (ERM) can rival or even outperform state-of-the-art out-of-distribution (OOD) generalization methods, and (ii) its OOD generalization performance improves when all available covariates, not just causal ones, are utilized. Drawing on both empirical and theoretical evidence, we attribute this phenomenon to hidden confounding. Shifts in hidden confounding induce changes in data distributions that violate assumptions commonly made by existing OOD generalization approaches. Under such conditions, we prove that effective generalization requires learning…
Peer Reviews
Decision·ICLR 2026 Poster
This paper provides a neat information theory based framework to understand the OOD performance of predictors under hidden confounding shift The theory explains two empirical phenomenon: 1) Learning invariant representations is not optimal for hidden confounder shift 2)Non-causal features can improve performance The theory does indeed explain the empirical OOD performance of different training algorithms and the different sets of features (causal, arguably causal, all) used. However, in practice
Regarding invariant representations: It is well known that under hidden confounder shift, they would underperform compared to methods that use information about the environment or its proxies. This intuitively makes sense since having information about the environment should help over methods that do not consider that. I believe the primary strength of invariant methods is in cases where the environment in test is out of the support of the train environments. Therefore, I do not see explaining t
The work theoretically justifies the importance of non-causal, informative covariates (XI) by showing they help maximize generalization performance. Proposition 4.1 demonstrates that adding these variables reduces concept shift and increases conditional informativeness, thereby mitigating the negative impact of unobserved confounders. Experiments using synthetic data with known causal structure and extensive testing on eight real-world tabular datasets (TableShift benchmark) consistently con
The core theoretical insights (Theorems 4.2 and related decompositions) are dependent on assuming a very specific, unobserved causal graph. This foundational assumption remains unverifiable in real-world data. This limits the ability of the derived guidance to be applied with certainty, as the true underlying causal structure is unknown. The empirical validation is focused almost exclusively on tabular prediction tasks employing relatively simple models like XGBoost and MLP. The role of the fea
1. Understanding out-of-domain generalization is a very important problem, and the paper targets at a key issue in this area, which is the performance gap between robustness-oriented methods and plain ERM. The results in the paper should benefit the community in both understanding and subsequent method development. 2. The paper is well-written and well-structured, with rich discussion and inspiring insights. 3. The theory is supported by solid experiments and empirical insights.
1. Maybe I missed something but it would be helpful to clarify what datasets the results in Section 5 are from (are they from synthetic data or the TableShift benchmark?). Seems the results are from one single dataset? Or did you merge all the data? 2. While the authors clarify that this paper aims to provide empirical insights instead of solutions, which is totally fair, it might be useful to suggest some solutions based on the observed results.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSleep and Work-Related Fatigue
