Self-Attention as Distributional Projection: A Unified Interpretation of Transformer Architecture
Nihal Mehta

TL;DR
This paper offers a mathematical interpretation of self-attention in Transformers, linking it to distributional semantics and showing how it naturally arises from projecting co-occurrence statistics, explaining the architecture's design choices.
Contribution
It introduces a unified projection-based framework for understanding self-attention, connecting it to distributional semantics and deriving Transformer components from this principle.
Findings
Self-attention can be derived from projecting co-occurrence matrices.
Positional encodings and multi-head attention are structured refinements of the projection principle.
Transformer architecture's algebraic form follows from the distributional projection framework.
Abstract
This paper presents a mathematical interpretation of self-attention by connecting it to distributional semantics principles. We show that self-attention emerges from projecting corpus-level co-occurrence statistics into sequence context. Starting from the co-occurrence matrix underlying GloVe embeddings, we demonstrate how the projection naturally captures contextual influence, with the query-key-value mechanism arising as the natural asymmetric extension for modeling directional relationships. Positional encodings and multi-head attention then follow as structured refinements of this same projection principle. Our analysis demonstrates that the Transformer architecture's particular algebraic form follows from these projection principles rather than being an arbitrary design choice.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbodied and Extended Cognition · Philosophy and Theoretical Science · Ferroelectric and Negative Capacitance Devices
