The Impact of Positional Encodings on Multilingual Compression
Vinit Ravishankar, Anders S{\o}gaard

TL;DR
This paper investigates how different positional encoding methods affect multilingual transformer models, revealing that sinusoidal encodings outperform learned ones in multilingual settings due to their inherent compositionality.
Contribution
The study demonstrates that sinusoidal positional encodings are more effective for multilingual models, explaining why modifications improve monolingual but not multilingual performance.
Findings
Sinusoidal encodings facilitate compositionality in multilingual models.
Learned positional encodings approximate sinusoidal ones but lack compositionality.
Complex positional encoding architectures are less effective in multilingual settings.
Abstract
In order to preserve word-order information in a non-autoregressive setting, transformer architectures tend to include positional knowledge, by (for instance) adding positional encodings to token embeddings. Several modifications have been proposed over the sinusoidal positional encodings used in the original transformer architecture; these include, for instance, separating position encodings and token embeddings, or directly modifying attention weights based on the distance between word pairs. We first show that surprisingly, while these modifications tend to improve monolingual language models, none of them result in better multilingual language models. We then answer why that is: Sinusoidal encodings were explicitly designed to facilitate compositionality by allowing linear projections over arbitrary time steps. Higher variances in multilingual training distributions requires higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
