Transformers Handle Endogeneity in In-Context Linear Regression
Haodong Liang, Krishnakumar Balasubramanian, Lifeng Lai

TL;DR
This paper demonstrates that transformers can inherently handle endogeneity in linear regression by emulating instrumental variable methods, providing a new approach that improves robustness and reliability over traditional techniques.
Contribution
It introduces a transformer-based method that emulates 2SLS for endogeneity correction and offers theoretical guarantees for its effectiveness.
Findings
Transformers can emulate 2SLS at exponential convergence.
The proposed pretraining scheme achieves small excess loss.
Trained transformers outperform 2SLS in robustness and reliability.
Abstract
We explore the capability of transformers to address endogeneity in in-context linear regression. Our main finding is that transformers inherently possess a mechanism to handle endogeneity effectively using instrumental variables (IV). First, we demonstrate that the transformer architecture can emulate a gradient-based bi-level optimization procedure that converges to the widely used two-stage least squares solution at an exponential rate. Next, we propose an in-context pretraining scheme and provide theoretical guarantees showing that the global minimizer of the pre-training loss achieves a small excess loss. Our extensive experiments validate these theoretical findings, showing that the trained transformer provides more robust and reliable in-context predictions and coefficient estimates than the method, in the presence of endogeneity.
Peer Reviews
Decision·ICLR 2025 Poster
I want to caveat this review by saying that I have flagged to the AC that this is not my area of expertise. The paper seems interesting: it extends the theoretical analysis done in previous work that looks at the class of functions that transformers can learn (e.g. simple linear regression) to a more complex set -- those with endogeneity and corresponding instrument variables. The authors show that transformers can learn this function and do as well as the direct solvers in most cases and poten
I want to caveat this review by saying that I have flagged to the AC that this is not my area of expertise. I do not know enough about this area to understand the potential flaws in their statements / etc or more subtle points. At a broad level, their intro / overview makes sense and seems convincing.
I am not familiar with IV regression and endogeneity, so I haven't reviewed the mathematical accuracy of the theorem. My comments are based solely on a basic understanding of the motivation, overall contribution, and presentation. Please consider them with low weights. Overall, I find the paper well-motivated, with clear writing. The background information and literature review appear thorough. Understanding the mechanism of the Transformer could be valuable for advancing future research in th
I understand that theoretical analysis requires specific assumptions. I am, however, curious whether it might be possible to extend the theoretical analysis to the non-linear case, as most real-world scenarios tend to be non-linear. Additionally, could you provide an example of real-world applications where the proposed analysis could be beneficial? If such examples exist, is it feasible to validate the analysis experimentally?
(1) This paper creatively combines transformer architectures with econometric techniques, specifically instrumental variables, to address endogeneity—a novel approach that extends transformers' applicability beyond traditional machine learning domains. (2) The authors provide rigorous theoretical backing, including a bi-level optimization framework, and offer non-asymptotic error bounds, supporting their claims with comprehensive experiments that validate the model's performance against standar
(1) The paper provides strong theoretical foundations but lacks practical guidance for implementation. More details on the parameter settings, model configurations, and optimization process would enhance reproducibility and help readers better understand how to apply the proposed method. (2) While the theoretical contributions are thorough, the presentation is complex and could benefit from simplification or visual aids. This would make the bi-level optimization framework and convergence proper
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Neural Networks and Applications
