LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models
Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, Sinong, Wang

TL;DR
LM-Infinite is a method that enables large language models to effectively process extremely long contexts, up to 200 million tokens, without retraining, significantly improving their applicability to long-text tasks.
Contribution
The paper introduces LM-Infinite, a simple, flexible, and parameter-free approach that enhances LLMs' ability to handle ultra-long inputs, overcoming limitations of existing techniques.
Findings
Enables models trained on 2K-4K segments to process up to 200M tokens.
Achieves 2.7x decoding speedup and 7.5x memory reduction.
Improves zero-shot performance on tasks like Passkey Retrieval and Qasper.
Abstract
Today's large language models (LLMs) typically train on short text segments (e.g., <4K tokens) due to the quadratic complexity of their Transformer architectures. As a result, their performance suffers drastically on inputs longer than those encountered during training, substantially limiting their applications in real-world tasks involving long contexts such as encoding scientific articles, code repositories, or long dialogues. Through theoretical analysis and empirical investigation, this work identifies three major factors contributing to this length generalization failure. Our theoretical analysis further reveals that commonly used techniques like truncating the attention window or relative positional encodings are inadequate to address them. Answering these challenges, we propose LM-Infinite, a simple and effective method for enhancing LLMs' capabilities of handling long contexts.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Linear Layer · Dropout · Multi-Head Attention · Position-Wise Feed-Forward Layer · Layer Normalization · Absolute Position Encodings · Softmax · Dense Connections · Label Smoothing
