Contextual Position Encoding: Learning to Count What's Important
Olga Golovneva, Tianlu Wang, Jason Weston, Sainbayar Sukhbaatar

TL;DR
This paper introduces Contextual Position Encoding (CoPE), a novel method enabling language models to condition position information on context, improving their ability to handle abstract and higher-level positional tasks beyond simple token counts.
Contribution
The paper proposes CoPE, a new position encoding technique that allows context-dependent position conditioning, enhancing model generalization to complex positional tasks.
Findings
CoPE outperforms traditional PE methods on counting and selective copy tasks.
CoPE improves perplexity in language modeling and coding tasks.
CoPE enables attending to higher-level structures like sentences or nouns.
Abstract
The attention mechanism is a critical component of Large Language Models (LLMs) that allows tokens in a sequence to interact with each other, but is order-invariant. Incorporating position encoding (PE) makes it possible to address by position, such as attending to the i-th token. However, current PE methods use token counts to derive position, and thus cannot generalize to higher levels of abstraction, such as attending to the i-th sentence. In this paper, we propose a new position encoding method, Contextual Position Encoding (CoPE), that allows positions to be conditioned on context by incrementing position only on certain tokens determined by the model. This allows more general position addressing such as attending to the -th particular word, noun, or sentence. We show that CoPE can solve the selective copy, counting and Flip-Flop tasks where popular position embeddings fail, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax
