Train Short, Test Long: Attention with Linear Biases Enables Input   Length Extrapolation

Ofir Press; Noah A. Smith; Mike Lewis

arXiv:2108.12409·cs.CL·April 26, 2022·30 cites

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A. Smith, Mike Lewis

PDF

Open Access 5 Repos 10 Models 1 Video

TL;DR

This paper introduces ALiBi, a simple and efficient positional bias method that enables transformer models to extrapolate to longer input sequences at inference time, improving performance and training efficiency.

Contribution

The paper proposes ALiBi, a novel positional bias technique that allows models to generalize to longer sequences without additional positional embeddings, enhancing training speed and memory efficiency.

Findings

01

ALiBi enables models to extrapolate to twice the training sequence length.

02

ALiBi trains faster and uses less memory than sinusoidal embeddings.

03

ALiBi outperforms other position methods on WikiText-103.

Abstract

Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though we find that current methods do not allow for efficient extrapolation. We therefore introduce a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. We show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

Methods(Quick Help)What is the cheapest day of the week to buy Amtrak tickets? · Attention with Linear Biases