A Little Goes a Long Way: Efficient Long Context Training and Inference   with Partial Contexts

Suyu Ge; Xihui Lin; Yunan Zhang; Jiawei Han; Hao Peng

arXiv:2410.01485·cs.CL·December 6, 2024

A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts

Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng

PDF

Open Access

TL;DR

This paper introduces LongGen, a method for efficient long-context training and inference in LLMs by integrating length extension with GPU-friendly sparse attention architectures, resulting in significant speedups and memory savings.

Contribution

LongGen combines length extension with a hybrid sparse attention architecture, enabling efficient long-context LLM training and inference with minimal additional training.

Findings

01

Achieves 1.55x training speedup on 128K contexts

02

Reduces KV cache memory by 62% during inference

03

Demonstrates effectiveness on Llama-2 models of different scales

Abstract

Training and serving long-context large language models (LLMs) incurs substantial overhead. To address this, two critical steps are often required: a pretrained LLM typically undergoes a separate stage for context length extension by training on long-context data, followed by architectural modifications to reduce the overhead of KV cache during serving. This paper argues that integrating length extension with a GPU-friendly KV cache reduction architecture not only reduces training overhead during length extension, but also achieves better long-context performance. This leads to our proposed LongGen, which finetunes a pretrained LLM into an efficient architecture during length extension. LongGen builds on three key insights: (1) Sparse attention patterns, such as window attention (attending to recent tokens), attention sink (initial ones), and blockwise sparse attention (strided token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning

MethodsSoftmax · Attention Is All You Need