Towards Infinite Length Extrapolation: A Unified Approach
Nitin Vetcha

TL;DR
This paper introduces a unified framework for positional encoding in large language models, proposes Adaptive Positional Encoding (APE) for better long-range dependency handling, and demonstrates its effectiveness on datasets with extremely long sequences.
Contribution
It presents a unified reinterpretation of positional encoding methods, introduces APE with adaptive frequency modulation, and provides theoretical conditions for infinite-length extrapolation.
Findings
APE enables models to process sequences up to 32,000 words.
The framework unifies and generalizes existing positional encoding methods.
Theoretical analysis guarantees well-defined normalization over unbounded sequences.
Abstract
Large language models (LLMs) have revolutionized natural language processing, but their ability to process long sequences is fundamentally limited by the context window size during training. Existing length extrapolation methods often suffer from performance degradation or computational inefficiencies. We thereby use a unified framework that reinterprets positional encoding methods as a decomposition of the attention score into a multiplicative transformation and an additive bias. This perspective not only subsumes popular approaches such as relative position embeddings and attention-bias moderated approaches but also exposes their inherent limitations in handling long-range dependencies. To address these shortcomings, motivated by our framework, we introduce Adaptive Positional Encoding (APE), which leverages adaptive frequency modulation and an intricately designed decay bias that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
