A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts
Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng

TL;DR
This paper introduces LongGen, a method for efficient long-context training and inference in LLMs by integrating length extension with GPU-friendly sparse attention architectures, resulting in significant speedups and memory savings.
Contribution
LongGen combines length extension with a hybrid sparse attention architecture, enabling efficient long-context LLM training and inference with minimal additional training.
Findings
Achieves 1.55x training speedup on 128K contexts
Reduces KV cache memory by 62% during inference
Demonstrates effectiveness on Llama-2 models of different scales
Abstract
Training and serving long-context large language models (LLMs) incurs substantial overhead. To address this, two critical steps are often required: a pretrained LLM typically undergoes a separate stage for context length extension by training on long-context data, followed by architectural modifications to reduce the overhead of KV cache during serving. This paper argues that integrating length extension with a GPU-friendly KV cache reduction architecture not only reduces training overhead during length extension, but also achieves better long-context performance. This leads to our proposed LongGen, which finetunes a pretrained LLM into an efficient architecture during length extension. LongGen builds on three key insights: (1) Sparse attention patterns, such as window attention (attending to recent tokens), attention sink (initial ones), and blockwise sparse attention (strided token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning
MethodsSoftmax · Attention Is All You Need
