Deconstructing Positional Information: From Attention Logits to Training Biases
Zihan Gu, Ruoyu Chen, Han Zhang, Hua Zhang, Yue Hu

TL;DR
This paper analyzes how different types of positional encodings in Transformers influence their ability to integrate positional and semantic information, revealing inherent biases and advantages of multiplicative encodings through synthetic tasks and theoretical insights.
Contribution
It provides a structured analysis of additive and multiplicative positional encodings, introduces a synthetic task to compare their effectiveness, and uncovers inherent training biases in shallow layers.
Findings
Multiplicative encodings outperform additive ones on the synthetic task.
A hidden training bias, the single-head deposit pattern, exists in shallow layers.
This bias is inherent to multiplicative encodings and affects training dynamics.
Abstract
Positional encodings enable Transformers to incorporate sequential information, yet their theoretical understanding remains limited to two properties: distance attenuation and translation invariance. Because natural language lacks purely positional data, the interplay between positional and semantic information is still underexplored. We address this gap by deconstructing the attention-logit computation and providing a structured analysis of positional encodings, categorizing them into additive and multiplicative forms. The differing properties of these forms lead to distinct mechanisms for capturing positional information. To probe this difference, we design a synthetic task that explicitly requires strong integration of positional and semantic cues. As predicted, multiplicative encodings achieve a clear performance advantage on this task. Moreover, our evaluation reveals a hidden…
Peer Reviews
Decision·ICLR 2026 Poster
- Relevant topic with a clear gap in the literature, whereby the differences across different strategies for positional encoding are poorly understood, making application to new tasks ad-hoc and reliant upon hyperparameter tuning, rather than theoretical insight. This paper aims to provide theoretical insight into a very clear and directly stated question: "how, precisely, do different PE schemes mediate the interaction between token content and position). The answers are interesting in their ow
- It wasn't clear to me why the author's posit that "intense structurally-induced specialization is a primary cause of the gap between RoPE's theoretical pmroise and its practical performance." This theme comes up many times (e.g. line 387 "the deposit pattern, while effective, is an inefficient use of model capacity"), and I never felt the author's fully explained their rationale. Modularity is often praised for its benefits, in contrast to this position. - Some aspects of prior literature wer
1. PE is used in almost all LLMs (except NoPE variants), therefore this is an important topic to analyze for downstream studies. 2. The framework proposed in this study can be used to investigate those widely used PE methods. Thus the finding in this paper is applicable to a wide range of LLM studies. 3. The tasks used to analyze PE are simple, and the simplicity is also very important for XAI. The proposed position-dependent and position-agnostic tasks can directly isolate the effects of eac
Overall this study seems novel and interesting. However, the presentation of the current draft could be improved. This study aims to organize multiple things, absolute PE, Relative PE, RoPE, Alibi, fixed PE, learnable PE, additive PE, multiplicative PE. The current presentation seems not clear enough to organize all of those, especially in Fig 1. It took me several “hops” to combine those pieces, and what Toeplitz matrices are used for, but I could be wrong or miss something. Another weakness
1. The paper provides clear insights by formalizing the mechanistic distinction between additive and multiplicative approaches to incorporating and modulating interactions between positional and content information in model representations, particularly in relation to relative positional encoding. Its controlled experimental settings effectively demonstrate the various idiosyncrasies of RoPE as consequences of its multiplicative interaction mechanism, most notably the single-head deposit phenome
1. The second synthetic task the authors design (trigger word counting) seems rather off the point. It essentially tests to what extent different position encoding methods can suppress themselves and perform the “no-op” behavior, which is not related to the authors' characterization of the particular attention matrix component to which the relative positional information is injected via the Toeplitz formulation. From my perspective, the second task should be content-agnostic instead of position-
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeurobiology of Language and Bilingualism · Multimodal Machine Learning Applications · Face Recognition and Perception
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Softmax · Position-Wise Feed-Forward Layer
