Training-Free Long-Context Scaling of Large Language Models
Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang, Zhou, Lingpeng Kong

TL;DR
This paper introduces Dual Chunk Attention, a training-free method enabling large language models like Llama2 70B to handle over 100k token contexts efficiently, matching or surpassing finetuned models on practical tasks.
Contribution
The paper presents Dual Chunk Attention, a novel attention mechanism that allows long-context processing in LLMs without additional training, improving scalability and performance.
Findings
Supports over 100k tokens without training
Achieves comparable or better performance than finetuned models
Reaches 94% of GPT-3.5-16k performance
Abstract
The ability of Large Language Models (LLMs) to process and generate coherent text is markedly weakened when the number of input tokens exceeds their pretraining length. Given the expensive overhead of finetuning large-scale models with longer sequences, we propose Dual Chunk Attention (DCA), which enables Llama2 70B to support context windows of more than 100k tokens without continual training. By decomposing the attention computation for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens within the same chunk (Intra-Chunk) and across distinct chunks (Inter-Chunk), as well as integrates seamlessly with Flash Attention. In addition to its impressive extrapolation capability, DCA achieves performance on practical long-context tasks that is comparable to or even better than that of finetuned models. When compared with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Qwen/Qwen3-30B-A3B-Instruct-2507model· 1.0M dl· ♡ 7951.0M dl♡ 795
- 🤗Qwen/Qwen3-235B-A22B-Instruct-2507model· 178k dl· ♡ 770178k dl♡ 770
- 🤗ServiceNow-AI/Apriel-1.6-15b-Thinkermodel· 1.7k dl· ♡ 2961.7k dl♡ 296
- 🤗Qwen/Qwen3-235B-A22B-Thinking-2507model· 78k dl· ♡ 40378k dl♡ 403
- 🤗Qwen/Qwen3-30B-A3B-Thinking-2507model· 1.0M dl· ♡ 3711.0M dl♡ 371
- 🤗AIDXteam/Qwen3-235B-A22B-Thinking-2507-AWQmodel· 4 dl4 dl
- 🤗AmirHaz/Affine-yollloooomodel· 19 dl19 dl
- 🤗Mungert/Qwen3-30B-A3B-Thinking-2507-GGUFmodel· 234 dl234 dl
- 🤗Mungert/Qwen3-30B-A3B-Instruct-2507-GGUFmodel· 145 dl· ♡ 2145 dl♡ 2
- 🤗Intellicia/Sullivanmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
