Learning interpretable positional encodings in transformers depends on initialization

Takuya Ito; Luca Cocchi; Tim Klinger; Parikshit Ram; Murray Campbell; Luke Hearne

arXiv:2406.08272·cs.LG·June 24, 2025·1 cites

Learning interpretable positional encodings in transformers depends on initialization

Takuya Ito, Luca Cocchi, Tim Klinger, Parikshit Ram, Murray Campbell, Luke Hearne

PDF

Open Access

TL;DR

This paper shows that the initialization of learnable positional encodings in transformers critically affects their ability to learn interpretable and effective position representations, especially in complex, multi-dimensional datasets.

Contribution

It demonstrates that small-norm initialization of learnable PEs enables interpretable and generalizable position encoding learning across diverse tasks.

Findings

01

Learned PEs initialized from small-norm distributions uncover interpretable positions.

02

Proper initialization improves generalization in complex datasets.

03

Empirical validation across 2D, stochastic, and 3D neuroscience tasks.

Abstract

In transformers, the positional encoding (PE) provides essential information that distinguishes the position and order amongst tokens in a sequence. Most prior investigations of PE effects on generalization were tailored to 1D input sequences, such as those presented in natural language, where adjacent tokens (e.g., words) are highly related. In contrast, many real world tasks involve datasets with highly non-trivial positional arrangements, such as datasets organized in multiple spatial dimensions, or datasets for which ground truth positions are not known. Here we find that the choice of initialization of a learnable PE greatly influences its ability to learn interpretable PEs that lead to enhanced generalization. We empirically demonstrate our findings in three experiments: 1) A 2D relational reasoning task; 2) A nonlinear stochastic network simulation; 3) A real world 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Multi-Agent Systems and Negotiation · Logic, Reasoning, and Knowledge

MethodsResidual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer