Functional Interpolation for Relative Positions Improves Long Context Transformers
Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago, Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, Srinadh, Bhojanapalli

TL;DR
This paper introduces FIRE, a new relative position encoding method for Transformers that enhances their ability to process longer sequences by improving generalization, supported by theoretical proofs and empirical results.
Contribution
FIRE provides a novel interpolation-based relative position encoding that generalizes well to longer contexts, outperforming existing methods.
Findings
FIRE can represent popular relative position encodings like T5's RPE, Alibi, and Kerple.
FIRE models demonstrate improved long-context generalization in language modeling.
Empirical results show FIRE's effectiveness on long text benchmarks.
Abstract
Preventing the performance decay of Transformers on inputs longer than those used for training has been an important challenge in extending the context length of these models. Though the Transformer architecture has fundamentally no limits on the input sequence lengths it can process, the choice of position encoding used during training can limit the performance of these models on longer inputs. We propose a novel functional relative position encoding with progressive interpolation, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We next empirically show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection · Layer Normalization · Linear Layer · Multi-Head Attention
