Learning to Encode Position for Transformer with Continuous Dynamical Model
Xuanqing Liu, Hsiang-Fu Yu, Inderjit Dhillon, Cho-Jui Hsieh

TL;DR
This paper proposes a learnable position encoding method for Transformers using Neural ODEs, enabling flexible, extrapolatable position representations that improve performance on language tasks.
Contribution
Introduces a continuous dynamical model-based position encoding for Transformers, addressing limitations of sinusoidal and embedding methods with a learnable, extrapolatable approach.
Findings
Consistent performance improvements on translation tasks
Enhanced flexibility and length extrapolation capabilities
Effective modeling of position evolution as a dynamical system
Abstract
We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models. Unlike RNN and LSTM, which contain inductive bias by loading the input tokens sequentially, non-recurrent models are less sensitive to position. The main reason is that position information among input units is not inherently encoded, i.e., the models are permutation equivalent; this problem justifies why all of the existing models are accompanied by a sinusoidal encoding/embedding layer at the input. However, this solution has clear limitations: the sinusoidal encoding is not flexible enough as it is manually designed and does not contain any learnable parameters, whereas the position embedding restricts the maximum length of input sequences. It is thus desirable to design a new position layer that contains learnable parameters to adjust to different datasets and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · 1x1 Convolution · Convolution · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia?
