SeqPE: Transformer with Sequential Position Encoding
Huayang Li, Yahui Liu, Hongyu Sun, Deng Cai, Leyang Cui, Wei Bi, Peilin Zhao, Taro Watanabe

TL;DR
SeqPE introduces a fully learnable, unified position encoding method for Transformers that improves extrapolation and generalization across modalities by representing positions as symbolic sequences and employing regularization techniques.
Contribution
The paper proposes SeqPE, a novel position encoding framework that enhances extrapolation and adaptability in Transformers through symbolic sequence representation and end-to-end learning.
Findings
Outperforms baselines in perplexity, EM, and accuracy.
Improves extrapolation to longer contexts and multi-dimensional inputs.
Enables seamless generalization without architectural redesign.
Abstract
Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position embeddings (PEs) limit extrapolation capabilities beyond pre-trained sequence lengths. Expert-designed methods such as ALiBi and RoPE, mitigate this limitation but demand extensive modifications for adapting to new modalities, underscoring fundamental challenges in adaptability and scalability. In this work, we present SeqPE, a unified and fully learnable position encoding framework that represents each -dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings in an end-to-end manner. To regularize SeqPE's embedding space, we introduce two complementary objectives: a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSensor Technology and Measurement Systems · Photonic and Optical Devices
MethodsAttention with Linear Biases · Knowledge Distillation
