Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press, Noah A. Smith, Mike Lewis

TL;DR
This paper introduces ALiBi, a simple and efficient positional bias method that enables transformer models to extrapolate to longer input sequences at inference time, improving performance and training efficiency.
Contribution
The paper proposes ALiBi, a novel positional bias technique that allows models to generalize to longer sequences without additional positional embeddings, enhancing training speed and memory efficiency.
Findings
ALiBi enables models to extrapolate to twice the training sequence length.
ALiBi trains faster and uses less memory than sinusoidal embeddings.
ALiBi outperforms other position methods on WikiText-103.
Abstract
Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though we find that current methods do not allow for efficient extrapolation. We therefore introduce a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. We show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗bigscience/bloommodel· 7.4k dl· ♡ 49897.4k dl♡ 4989
- 🤗baichuan-inc/Baichuan-13B-Chatmodel· 8.8k dl· ♡ 6338.8k dl♡ 633
- 🤗Hum-Works/lodestone-base-4096-v1model· 112 dl· ♡ 12112 dl♡ 12
- 🤗bigscience/bloom-560mmodel· 192k dl· ♡ 371192k dl♡ 371
- 🤗bigscience/bloom-1b1model· 6.6k dl· ♡ 666.6k dl♡ 66
- 🤗bigscience/bloom-1b7model· 55k dl· ♡ 12255k dl♡ 122
- 🤗bigscience/bloom-3bmodel· 10k dl· ♡ 9410k dl♡ 94
- 🤗bigscience/bloom-7b1model· 11k dl· ♡ 20211k dl♡ 202
- 🤗bigscience/bloom-intermediatemodel· 12 dl· ♡ 1212 dl♡ 12
- 🤗bigscience/distill-bloom-1b3model· 46 dl· ♡ 946 dl♡ 9
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
Methods(Quick Help)What is the cheapest day of the week to buy Amtrak tickets? · Attention with Linear Biases
