Dynamical Properties of Tokens in Self-Attention and Effects of Positional Encoding
Duy-Tung Pham, An The Nguyen, Viet-Hoang Tran, Nhan-Phu Chung, Xin T. Tong, Tan M. Nguyen, Thieu N. Vo

TL;DR
This paper analyzes the dynamical behavior of tokens in pre-trained Transformers, revealing how positional encoding influences convergence and divergence, and proposes architectural refinements to enhance model performance.
Contribution
It provides a theoretical analysis of token dynamics in Transformers, characterizes conditions for convergence/divergence, and introduces refinements to mitigate adverse effects of positional encoding.
Findings
Convergence behavior negatively impacts model performance.
Different positional encodings affect token dynamics distinctly.
Proposed refinements improve Transformer robustness.
Abstract
This paper investigates the dynamical properties of tokens in pre-trained Transformer models and explores their application to improving Transformers. To this end, we analyze the dynamical system governing the continuous-time limit of the pre-trained model and characterize the asymptotic behavior of its solutions. Specifically, we characterize when tokens move closer to or farther from one another over time, depending on the model parameters. We provide sufficient conditions, based on these parameters, to identify scenarios where tokens either converge to zero or diverge to infinity. Unlike prior works, our conditions are broader in scope and more applicable to real-world models. Furthermore, we investigate how different forms of positional encoding -- specifically absolute and rotary -- affect these dynamical regimes. Empirical evidence reveals that the convergence scenario adversely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning · Model Reduction and Neural Networks
