Functional Interpolation for Relative Positions Improves Long Context   Transformers

Shanda Li; Chong You; Guru Guruganesh; Joshua Ainslie; Santiago; Ontanon; Manzil Zaheer; Sumit Sanghai; Yiming Yang; Sanjiv Kumar; Srinadh; Bhojanapalli

arXiv:2310.04418·cs.LG·March 5, 2024·2 cites

Functional Interpolation for Relative Positions Improves Long Context Transformers

Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago, Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, Srinadh, Bhojanapalli

PDF

Open Access 1 Models

TL;DR

This paper introduces FIRE, a new relative position encoding method for Transformers that enhances their ability to process longer sequences by improving generalization, supported by theoretical proofs and empirical results.

Contribution

FIRE provides a novel interpolation-based relative position encoding that generalizes well to longer contexts, outperforming existing methods.

Findings

01

FIRE can represent popular relative position encodings like T5's RPE, Alibi, and Kerple.

02

FIRE models demonstrate improved long-context generalization in language modeling.

03

Empirical results show FIRE's effectiveness on long text benchmarks.

Abstract

Preventing the performance decay of Transformers on inputs longer than those used for training has been an important challenge in extending the context length of these models. Though the Transformer architecture has fundamentally no limits on the input sequence lengths it can process, the choice of position encoding used during training can limit the performance of these models on longer inputs. We propose a novel functional relative position encoding with progressive interpolation, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We next empirically show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
CATIE-AQ/FAT5-small
model· 5 dl· ♡ 2
5 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection · Layer Normalization · Linear Layer · Multi-Head Attention