CLEX: Continuous Length Extrapolation for Large Language Models
Guanzheng Chen, Xin Li, Zaiqiao Meng, Shangsong Liang, Lidong Bing

TL;DR
CLEX introduces a novel continuous length extrapolation method for LLMs that models context length scaling as differential equations, enabling effective extension of context windows beyond training lengths with minimal performance loss.
Contribution
The paper proposes CLEX, a continuous length extrapolation technique that generalizes position embedding scaling via differential equations, allowing LLMs to handle much longer contexts beyond training lengths.
Findings
CLEX extends context windows to over 4x or nearly 8x training length.
CLEX maintains performance without deterioration when extrapolating to longer contexts.
CLEX achieves competitive results on LongBench with models trained on 4k length.
Abstract
Transformer-based Large Language Models (LLMs) are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding (PE) scaling methods, while effective in extending the context window to a specific length, demonstrate either notable limitations in their extrapolation abilities or sacrificing partial performance within the context window. Length extrapolation methods, although theoretically capable of extending the context window beyond the training sequence length, often underperform in practical long-context applications. To address these challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We generalise the PE scaling approaches to model the continuous dynamics by ordinary differential equations over the length scaling factor, thereby overcoming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗DAMO-NLP-SG/CLEX-7B-Chat-16Kmodel· 25 dl· ♡ 325 dl♡ 3
- 🤗DAMO-NLP-SG/CLEX-7B-16Kmodel· 22 dl· ♡ 322 dl♡ 3
- 🤗DAMO-NLP-SG/CLEX-Mixtral-8x7B-32Kmodel· 11 dl· ♡ 311 dl♡ 3
- 🤗DAMO-NLP-SG/CLEX-Mixtral-8x7B-Chat-32Kmodel· 16 dl· ♡ 116 dl♡ 1
- 🤗DAMO-NLP-SG/CLEX-LLaMA-2-7B-64Kmodel· 72 dl· ♡ 372 dl♡ 3
- 🤗DAMO-NLP-SG/CLEX-Phi-2-32Kmodel· 6 dl· ♡ 106 dl♡ 10
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Byte Pair Encoding · Dropout · Layer Normalization
