Analyzing Transformer Dynamics as Movement through Embedding Space
Sumeet S. Singh

TL;DR
This paper introduces a novel perspective on Transformer models by framing their dynamics as movement through embedding space, revealing how intelligent behaviors emerge from probabilistic paths and vector arrangements.
Contribution
It proposes a theory that models Transformer inference as paths in embedding space, unifies it with other sequence models, and formalizes a concept-space interpretation of embeddings.
Findings
Transformers' behaviors correspond to paths in embedding space.
Training learns a probability distribution over possible paths.
Embedding arrangements influence path probabilities and model behavior.
Abstract
Transformer based language models exhibit intelligent behaviors such as understanding natural language, recognizing patterns, acquiring knowledge, reasoning, planning, reflecting and using tools. This paper explores how their underlying mechanics give rise to intelligent behaviors. Towards that end, we propose framing Transformer dynamics as movement through embedding space. Examining Transformers through this perspective reveals key insights, establishing a Theory of Transformers: 1) Intelligent behaviours map to paths in Embedding Space which, the Transformer random-walks through during inferencing. 2) LM training learns a probability distribution over all possible paths. `Intelligence' is learnt by assigning higher probabilities to paths representing intelligent behaviors. No learning can take place in-context; context only narrows the subset of paths sampled during decoding. 5) The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Layer Normalization · Softmax · Dense Connections
