Transformer Is Inherently a Causal Learner
Xinyue Wang, Stephen Wang, Biwei Huang

TL;DR
This paper demonstrates that autoregressive transformers inherently encode causal structures in their learned representations, enabling effective causal discovery from time series data without explicit causal objectives.
Contribution
It establishes a theoretical and practical link between transformers and causal discovery, introducing a gradient attribution method that outperforms existing algorithms in complex scenarios.
Findings
Gradient sensitivities recover causal graphs accurately.
Method outperforms state-of-the-art algorithms in nonlinear and non-stationary systems.
Causal discovery accuracy improves with more data and heterogeneity.
Abstract
We reveal that transformers trained in an autoregressive manner naturally encode time-delayed causal structures in their learned representations. When predicting future values in multivariate time series, the gradient sensitivities of transformer outputs with respect to past inputs directly recover the underlying causal graph, without any explicit causal objectives or structural constraints. We prove this connection theoretically under standard identifiability conditions and develop a practical extraction method using aggregated gradient attributions. On challenging cases such as nonlinear dynamics, long-term dependencies, and non-stationary systems, this approach greatly surpasses the performance of state-of-the-art discovery algorithms, especially as data heterogeneity increases, exhibiting scaling potential where causal accuracy improves with data volume and heterogeneity, a property…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is well motivated, connecting two areas - causal discovery and transformer-based autoregressive modeling. The authors clearly argue that transformers, trained with standard forecasting objectives, may implicitly learn lagged causal structures. 2. The experiments cover a wide range of synthetic regimes, including high-dimensional, long-range, nonlinear, and non-stationary systems. 3. The transformer-based approach scales efficiently to higher dimensions and larger lag windows, wher
1. While the paper establishes a theoretical link between population gradients and causal identifiability, it does not provide empirical evidence that gradient-or LRP-based attributions reliably correspond to true causal effects. All experiments evaluate graph recovery accuracy but do not include diagnostic analyses verifying whether gradient magnitudes align with known interventional or conditional-independence causal measures. As a result, it remains unclear whether the model’s attributions ca
- The main claim—that a regular forecasting transformer effectively learns a causal graph without any explicit causal objective—is both elegant and conceptually appealing. It links large-scale predictive modeling with causal structure learning in a way that feels natural and potentially impactful. - The paper doesn’t rely solely on empirical results. Theorem 1 provides a clear connection between the forecasting task and recoverability of causal parents, assuming standard conditions. - The eval
- The method seems to require more data compared to some specialized approaches in low-dimensional or simpler linear systems. For example, VAR-LiNGAM requires far less data when the dynamics are mostly linear (Figure 3). - (minor) Figure 2’s caption labels two subplots as (C), which needs correction. Missing Citations / Context The paper compares well against classical baselines, but recent transformer-based causal discovery approaches are not discussed. For example, CausalFormer (Liu et al
- It's certainly of interest to the causality community to obtain scalable and efficient structure learning algorithms, and equally interesting to the more general ML community to understand causal properties of commonly-used architectures. - The results are impressive, with transformers outperforming several existing causal discovery methods in various settings. - The paper is well-written and well-motivated, which is particularly positive given the broad audience targeted.
The primary weakness in my opinion is the framing of the paper. To me, the paper essentially makes two separate arguments: (1) that under assumptions A1-A4, causal structure learning reduces to standard statistical learning; and (2), that transformers are particularly well-aligned to causal learning. Both of these arguments appear sound, but I would consider (1) to be well-known, and (2) to fall short of the claim that transformers are "inherently causal." Indeed, the authors concede that Theore
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Advanced Graph Neural Networks · Machine Learning in Healthcare
