DCIS: Efficient Length Extrapolation of LLMs via Divide-and-Conquer Scaling Factor Search
Lei Yang, Shaoyang Xu, Jianxiang Peng, Shaolin Zhu, Deyi Xiong

TL;DR
This paper introduces DCIS, a divide-and-conquer algorithm for efficiently determining optimal scaling factors to extend LLMs' context length, reducing fine-tuning costs and improving performance at longer contexts.
Contribution
The paper proposes a novel DCIS algorithm that strategically searches for scaling factors, enabling effective context length extension with less fine-tuning and higher efficiency than existing methods.
Findings
DCIS doubles search efficiency compared to other methods.
The identified scaling factors improve performance at extended lengths.
Models can generalize to longer contexts without additional fine-tuning.
Abstract
Large language models (LLMs) based on the Transformer architecture usually have their context length limited due to the high training cost. Recent advancements extend the context window by adjusting the scaling factors of RoPE and fine-tuning. However, suboptimal initialization of these factors results in increased fine-tuning costs and reduced performance at target length. To address these challenges, we propose a novel RoPE-based fine-tuning framework that diverges from conventional scaling factors search. Specifically, we present a \textbf{D}ivide-and-\textbf{C}onquer \textbf{I}ncremental \textbf{S}earch (DCIS) algorithm that strategically determines the better scaling factors. Further fine-tuning with the identified scaling factors effectively extends the context window of LLMs. Empirical results demonstrate that our methodology not only mitigates performance decay at extended…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAlgorithms and Data Compression · Handwritten Text Recognition Techniques · Natural Language Processing Techniques
MethodsByte Pair Encoding · Linear Layer · Absolute Position Encodings · Dropout · Softmax · Attention Is All You Need · Dense Connections · Residual Connection · Multi-Head Attention · Adam
