Rethinking Positional Encoding in Language Pre-training

Guolin Ke; Di He; Tie-Yan Liu

arXiv:2006.15595·cs.CL·March 16, 2021·68 cites

Rethinking Positional Encoding in Language Pre-training

Guolin Ke, Di He, Tie-Yan Liu

PDF

Open Access 3 Repos 1 Video

TL;DR

This paper critically examines existing positional encoding methods in language pre-training, identifies their limitations, and introduces TUPE, a novel encoding approach that separates positional and word correlations, improving model expressiveness and performance.

Contribution

The paper proposes TUPE, a new positional encoding method that unties the [CLS] token and separates correlations, enhancing expressiveness and performance in language models.

Findings

01

TUPE outperforms existing methods on the GLUE benchmark.

02

Separating positional and word correlations improves model expressiveness.

03

Untying [CLS] enhances sentence-level representation.

Abstract

In this work, we investigate the positional encoding methods used in language pre-training (e.g., BERT) and identify several problems in the existing formulations. First, we show that in the absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations between the two heterogeneous information resources. It may bring unnecessary randomness in the attention and further limit the expressiveness of the model. Second, we question whether treating the position of the symbol \texttt{[CLS]} the same as other words is a reasonable design, considering its special role (the representation of the entire sentence) in the downstream tasks. Motivated from above analysis, we propose a new positional encoding method called \textbf{T}ransformer with \textbf{U}ntied \textbf{P}ositional \textbf{E}ncoding (TUPE). In the self-attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Rethinking Positional Encoding in Language Pre-training· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Linear Warmup With Linear Decay · WordPiece · Residual Connection · Label Smoothing