Breaking the Stage Barrier: A Novel Single-Stage Approach to Long   Context Extension for Large Language Models

Haoran Lian; Junmin Chen; Wei Huang; Yizhe Xiong; Wenping Hu; Guiguang; Ding; Hui Chen; Jianwei Niu; Zijia Lin; Fuzheng Zhang; Di Zhang

arXiv:2412.07171·cs.CL·December 11, 2024

Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models

Haoran Lian, Junmin Chen, Wei Huang, Yizhe Xiong, Wenping Hu, Guiguang, Ding, Hui Chen, Jianwei Niu, Zijia Lin, Fuzheng Zhang, Di Zhang

PDF

Open Access

TL;DR

This paper introduces HARPE, a single-stage continual pretraining method for large language models that enhances their ability to handle long contexts efficiently, outperforming multi-stage approaches.

Contribution

The paper proposes a novel single-stage pretraining technique, HARPE, which simplifies long context training for LLMs by using head-adaptive rotary position encoding.

Findings

01

HARPE outperforms multi-stage methods on 4 language modeling benchmarks.

02

HARPE effectively models long contexts with a single training stage.

03

The approach simplifies training while maintaining or improving performance.

Abstract

Recently, Large language models (LLMs) have revolutionized Natural Language Processing (NLP). Pretrained LLMs, due to limited training context size, struggle with handling long token sequences, limiting their performance on various downstream tasks. Current solutions toward long context modeling often employ multi-stage continual pertaining, which progressively increases the effective context length through several continual pretraining stages. However, those approaches require extensive manual tuning and human expertise. In this paper, we introduce a novel single-stage continual pretraining method, Head-Adaptive Rotary Position Encoding (HARPE), to equip LLMs with long context modeling capabilities while simplifying the training process. Our HARPE leverages different Rotary Position Encoding (RoPE) base frequency values across different attention heads and directly trains LLMs on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsSoftmax · Attention Is All You Need · Balanced Selection