Decoupling Positional and Symbolic Attention Behavior in Transformers
Felipe Urrutia, Jorge Salas, Alexander Kozachinskiy, Cristian Buc Calderon, Hector Pasten, Cristobal Rojas

TL;DR
This paper investigates how Transformers encode positional and symbolic information, providing a theoretical framework and empirical analysis of attention head behaviors, and demonstrating control over model performance through frequency manipulation.
Contribution
It introduces a formal distinction between positional and symbolic attention behaviors, develops a metric for them, and shows how frequency control influences Transformer performance.
Findings
All attention heads show a strong link between behavior and frequency use.
Transformer performance can be controlled by restricting frequency access.
Theoretical proof that positional and symbolic behaviors are mutually exclusive.
Abstract
An important aspect subtending language understanding and production is the ability to independently encode positional and symbolic information of the words within a sentence. In Transformers, positional information is typically encoded using Positional Encodings (PEs). One such popular PE, namely Rotary PE (RoPE), has been widely used due to its empirical success. Recently, it has been argued that part of RoPE's success emerges from its ability to encode robust positional and semantic information using large and small frequencies, respectively. In this work, we perform a deeper dive into the positional versus symbolic dichotomy of attention heads behavior, both at the theoretical and empirical level. We provide general definitions of what it means for a head to behave positionally or symbolically, prove that these are two mutually exclusive behaviors and develop a metric to quantify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeurobiology of Language and Bilingualism · Action Observation and Synchronization · Topic Modeling
