Rethinking Positional Encoding in Language Pre-training
Guolin Ke, Di He, Tie-Yan Liu

TL;DR
This paper critically examines existing positional encoding methods in language pre-training, identifies their limitations, and introduces TUPE, a novel encoding approach that separates positional and word correlations, improving model expressiveness and performance.
Contribution
The paper proposes TUPE, a new positional encoding method that unties the [CLS] token and separates correlations, enhancing expressiveness and performance in language models.
Findings
TUPE outperforms existing methods on the GLUE benchmark.
Separating positional and word correlations improves model expressiveness.
Untying [CLS] enhances sentence-level representation.
Abstract
In this work, we investigate the positional encoding methods used in language pre-training (e.g., BERT) and identify several problems in the existing formulations. First, we show that in the absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations between the two heterogeneous information resources. It may bring unnecessary randomness in the attention and further limit the expressiveness of the model. Second, we question whether treating the position of the symbol \texttt{[CLS]} the same as other words is a reasonable design, considering its special role (the representation of the entire sentence) in the downstream tasks. Motivated from above analysis, we propose a new positional encoding method called \textbf{T}ransformer with \textbf{U}ntied \textbf{P}ositional \textbf{E}ncoding (TUPE). In the self-attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Linear Warmup With Linear Decay · WordPiece · Residual Connection · Label Smoothing
