LongSkywork: A Training Recipe for Efficiently Extending Context Length   in Large Language Models

Liang Zhao; Tianwen Wei; Liang Zeng; Cheng Cheng; Liu Yang; Peng; Cheng; Lijie Wang; Chenxia Li; Xuejie Wu; Bo Zhu; Yimeng Gan; Rui Hu,; Shuicheng Yan; Han Fang; Yahui Zhou

arXiv:2406.00605·cs.CL·June 4, 2024

LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models

Liang Zhao, Tianwen Wei, Liang Zeng, Cheng Cheng, Liu Yang, Peng, Cheng, Lijie Wang, Chenxia Li, Xuejie Wu, Bo Zhu, Yimeng Gan, Rui Hu,, Shuicheng Yan, Han Fang, Yahui Zhou

PDF

Open Access 2 Models

TL;DR

LongSkywork introduces a training recipe for large language models to efficiently extend context length up to 200,000 tokens, utilizing synthetic data and a specialized training stage to enhance long-context processing capabilities.

Contribution

The paper presents a novel training approach with a long-context SFT stage and synthetic data methods, enabling efficient extension of LLM context length with minimal additional training.

Findings

01

Synthetic long-context data can outperform human-curated data.

02

Models achieve perfect accuracy on long-context retrieval benchmarks.

03

LongSkywork-13B matches performance of leading models like Claude2.1.

Abstract

We introduce LongSkywork, a long-context Large Language Model (LLM) capable of processing up to 200,000 tokens. We provide a training recipe for efficiently extending context length of LLMs. We identify that the critical element in enhancing long-context processing capability is to incorporate a long-context SFT stage following the standard SFT stage. A mere 200 iterations can convert the standard SFT model into a long-context model. To reduce the effort in collecting and annotating data for long-context language modeling, we develop two novel methods for creating synthetic data. These methods are applied during the continual pretraining phase as well as the Supervised Fine-Tuning (SFT) phase, greatly enhancing the training efficiency of our long-context LLMs. Our findings suggest that synthetic long-context SFT data can surpass the performance of data curated by humans to some extent.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsShrink and Fine-Tune