TL;DR
This paper introduces StreamingLLM, a novel framework that enables large language models to efficiently handle infinite-length sequences in streaming applications by leveraging attention sinks, significantly improving performance and speed.
Contribution
The paper proposes StreamingLLM, a new method that allows LLMs trained on finite sequences to generalize to infinite sequences without fine-tuning, using the concept of attention sinks.
Findings
StreamingLLM enables stable language modeling with up to 4 million tokens.
It achieves up to 22.2x speedup over baseline methods.
Adding a placeholder token as an attention sink further improves streaming performance.
Abstract
Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach -- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important.…
Peer Reviews
Decision·ICLR 2024 poster
The paper is well written, the problem is well motivated with many experiments used to help make a convincing claim. All code and datasets are made available, allowing for easy reproducibility and validation of claims made in this work.
The claims on long horizon performance need additional clarity and evaluation. I am not sure how practically useful this approach is for long-horizon language modelling.. For example, at million-scale sequence lengths, when intermediate tokens are evicted, how does performance translate relative to the evicted tokens? Doesnt it mean, the model is only practically useful for information covered at the beginning of the sequence and at the end? There is a missing section on the broader societal im
- Several methods have been introduced in the past for generalizing to longer context lengths but most of them require training the model from scratch or through continued training. The observation that this paper is making about the attention sink phenomenon is fascinating and inspires a simple solution that requires no additional training that prior work has overlooked. - The paper is positioned well wrt. prior length generalization work and shows how different relative positional embedding m
- The proposed attention is reminiscent of methods that introduce sparse attention patterns such as Sparse Transformer but there is no in-depth discussion that draws a connection. - Even though attention sinks maintains the perplexity levels in check for extremely large sequence lengths, the paper does not study in detail on how good utilization of the context the model is doing.
Improving the efficiency of language modeling with long text input is a practically important and very timely topic. The paper is well motivated and structured in that it first introduces problems, suggests an easy and effective solution, and validates it with experiments and required ablation studies.
StreamingLLM is compared with only very basic baselines (i.e., (a) dense attention, (b) window attention, and (c) sliding window with re-computation) without alternative ways to use LLMs with longer inputs. The authors argue StreamingLLM’s 22.2x speedup compared to (c), but it looks too extreme because there could be a sweet spot that could achieve reasonable ppl and inference speed by sliding window with appropriate strides (less frequent re-computation instead of every step). The paper argues
Code & Models
- 🤗NickyNicky/Mistral-7B-OpenOrca-oasst_top1_2023-08-25-v2model· 746 dl· ♡ 11746 dl♡ 11
- 🤗NickyNicky/Mistral-7B-OpenOrca-oasst_top1_2023-08-25-v3model· 756 dl· ♡ 8756 dl♡ 8
- 🤗TheBloke/Mistral-7B-OpenOrca-oasst_top1_2023-08-25-v2-GGUFmodel· 197 dl· ♡ 6197 dl♡ 6
- 🤗TheBloke/Mistral-7B-OpenOrca-oasst_top1_2023-08-25-v2-AWQmodel· 3 dl· ♡ 23 dl♡ 2
- 🤗TheBloke/Mistral-7B-OpenOrca-oasst_top1_2023-08-25-v2-GPTQmodel· 7 dl· ♡ 57 dl♡ 5
- 🤗LoneStriker/Mistral-7B-OpenOrca-oasst_top1_2023-08-25-v2-3.0bpw-h6-exl2model· 1 dl1 dl
- 🤗LoneStriker/Mistral-7B-OpenOrca-oasst_top1_2023-08-25-v2-4.0bpw-h6-exl2model· 2 dl2 dl
- 🤗LoneStriker/Mistral-7B-OpenOrca-oasst_top1_2023-08-25-v2-6.0bpw-h6-exl2model· 2 dl2 dl
- 🤗LoneStriker/Mistral-7B-OpenOrca-oasst_top1_2023-08-25-v2-5.0bpw-h6-exl2model· 3 dl3 dl
- 🤗LoneStriker/Mistral-7B-OpenOrca-oasst_top1_2023-08-25-v2-8.0bpw-h8-exl2model· 2 dl2 dl
Videos
Efficient Streaming Language Models with Attention Sinks (Paper Explained)· youtube
StreamingLLM Demo· youtube
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsAttention Sinks · Pythia
