NExtLong: Toward Effective Long-Context Training without Long Documents

Chaochen Gao; Xing Wu; Zijia Lin; Debing Zhang; Songlin Hu

arXiv:2501.12766·cs.CL·May 27, 2025

NExtLong: Toward Effective Long-Context Training without Long Documents

Chaochen Gao, Xing Wu, Zijia Lin, Debing Zhang, Songlin Hu

PDF

Open Access 1 Repo 3 Models 5 Datasets

TL;DR

NExtLong introduces a novel data synthesis framework for training large language models to better understand long-range dependencies without needing actual long documents, improving performance on long-context benchmarks.

Contribution

NExtLong presents a new method for synthesizing long-context data using negative document extension, enhancing long-range dependency modeling in language models.

Findings

01

Significant performance improvements on HELMET and RULER benchmarks.

02

Reduces reliance on non-synthetic long documents for training.

03

Effective in enhancing long-context understanding in LLMs.

Abstract

Large language models (LLMs) with extended context windows have made significant strides yet remain a challenge due to the scarcity of long documents. Existing methods tend to synthesize long-context data but lack a clear mechanism to reinforce the long-range dependency modeling. To address this limitation, we propose NExtLong, a novel framework for synthesizing long-context data through Negative document Extension. NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora. This approach compels the model to discriminate long-range dependent context from distracting content, enhancing its ability to model long-range dependencies. Extensive experiments demonstrate that NExtLong achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

caskcsg/longcontext
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques