Transformers are Bayesian Networks
Gregory Coppola

TL;DR
This paper demonstrates that transformers function as Bayesian networks, implementing belief propagation and exact inference through their architecture, providing a formal understanding of their success.
Contribution
It formally proves that sigmoid transformers implement belief propagation, establishing a Bayesian network perspective and analyzing their structure and inference capabilities.
Findings
Transformers perform weighted loopy belief propagation.
They can implement exact belief propagation on knowledge bases.
The architecture's AND/OR structure aligns with Pearl's algorithm.
Abstract
Transformers are the dominant architecture in AI, yet why they work remains poorly understood. This paper offers a precise answer: a transformer is a Bayesian network. We establish this in five ways. First, we prove that every sigmoid transformer with any weights implements weighted loopy belief propagation on its implicit factor graph. One layer is one round of BP. This holds for any weights -- trained, random, or constructed. Formally verified against standard mathematical axioms. Second, we give a constructive proof that a transformer can implement exact belief propagation on any declared knowledge base. On knowledge bases without circular dependencies this yields provably correct probability estimates at every node. Formally verified against standard mathematical axioms. Third, we prove uniqueness: a sigmoid transformer that produces exact posteriors necessarily has BP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Advanced Graph Neural Networks · Logic, Reasoning, and Knowledge
