Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao; Yuandong Tian; Beidi Chen; Song Han; Mike Lewis

arXiv:2309.17453·cs.CL·April 9, 2024·32 cites

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis

PDF

Open Access 5 Repos 10 Models 2 Videos 3 Reviews

TL;DR

This paper introduces StreamingLLM, a novel framework that enables large language models to efficiently handle infinite-length sequences in streaming applications by leveraging attention sinks, significantly improving performance and speed.

Contribution

The paper proposes StreamingLLM, a new method that allows LLMs trained on finite sequences to generalize to infinite sequences without fine-tuning, using the concept of attention sinks.

Findings

01

StreamingLLM enables stable language modeling with up to 4 million tokens.

02

It achieves up to 22.2x speedup over baseline methods.

03

Adding a placeholder token as an attention sink further improves streaming performance.

Abstract

Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach -- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important.…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 8· accept, good paperConfidence 5

Strengths

The paper is well written, the problem is well motivated with many experiments used to help make a convincing claim. All code and datasets are made available, allowing for easy reproducibility and validation of claims made in this work.

Weaknesses

The claims on long horizon performance need additional clarity and evaluation. I am not sure how practically useful this approach is for long-horizon language modelling.. For example, at million-scale sequence lengths, when intermediate tokens are evicted, how does performance translate relative to the evicted tokens? Doesnt it mean, the model is only practically useful for information covered at the beginning of the sequence and at the end? There is a missing section on the broader societal im

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

- Several methods have been introduced in the past for generalizing to longer context lengths but most of them require training the model from scratch or through continued training. The observation that this paper is making about the attention sink phenomenon is fascinating and inspires a simple solution that requires no additional training that prior work has overlooked. - The paper is positioned well wrt. prior length generalization work and shows how different relative positional embedding m

Weaknesses

- The proposed attention is reminiscent of methods that introduce sparse attention patterns such as Sparse Transformer but there is no in-depth discussion that draws a connection. - Even though attention sinks maintains the perplexity levels in check for extremely large sequence lengths, the paper does not study in detail on how good utilization of the context the model is doing.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

Improving the efficiency of language modeling with long text input is a practically important and very timely topic. The paper is well motivated and structured in that it first introduces problems, suggests an easy and effective solution, and validates it with experiments and required ablation studies.

Weaknesses

StreamingLLM is compared with only very basic baselines (i.e., (a) dense attention, (b) window attention, and (c) sliding window with re-computation) without alternative ways to use LLMs with longer inputs. The authors argue StreamingLLM’s 22.2x speedup compared to (c), but it looks too extreme because there could be a sweet spot that could achieve reasonable ppl and inference speed by sliding window with appropriate strides (less frequent re-computation instead of every step). The paper argues

Code & Models

Repositories

Models

Videos

Efficient Streaming Language Models with Attention Sinks (Paper Explained)· youtube

StreamingLLM Demo· youtube

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsAttention Sinks · Pythia