Long-Context Language Modeling with Parallel Context Encoding
Howard Yen, Tianyu Gao, Danqi Chen

TL;DR
This paper introduces CEPE, a versatile framework that extends large language models' context windows to 128K tokens efficiently, enhancing language modeling, in-context learning, and retrieval-augmented tasks without retraining the entire model.
Contribution
The paper presents CEPE, a novel method to extend LLMs' context windows significantly using a small encoder, applicable to existing models, and demonstrates its effectiveness across various tasks.
Findings
CEPE extends LLAMA-2 to 128K tokens with 10x throughput and reduced memory.
CEPE improves performance on language modeling and in-context learning tasks.
A variant of CEPE enhances instruction-tuned models using only unlabeled data.
Abstract
Extending large language models (LLMs) to process longer inputs is crucial for a wide range of applications. However, the substantial computational cost of transformers and limited generalization of positional encoding restrict the size of their context window. We introduce Context Expansion with Parallel Encoding (CEPE), a framework that can be applied to any existing decoder-only LLMs to extend their context window. CEPE employs a small encoder to process long inputs chunk by chunk, enabling the frozen decoder to utilize additional contexts via cross-attention. CEPE is efficient, generalizable, and versatile: trained with 8K-token documents, it extends the context window of LLAMA-2 to 128K tokens, offering 10x the throughput with only 1/6 of the memory. CEPE yields strong performance on language modeling and in-context learning. CEPE also excels in retrieval-augmented applications,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling
