LongAlign: A Recipe for Long Context Alignment of Large Language Models

Yushi Bai; Xin Lv; Jiajie Zhang; Yuze He; Ji Qi; Lei Hou; Jie Tang,; Yuxiao Dong; Juanzi Li

arXiv:2401.18058·cs.CL·February 1, 2024·1 cites

LongAlign: A Recipe for Long Context Alignment of Large Language Models

Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang,, Yuxiao Dong, Juanzi Li

PDF

Open Access 1 Repo 9 Models 2 Datasets

TL;DR

LongAlign presents a comprehensive approach for training large language models to effectively understand and generate long-context sequences, combining new datasets, training strategies, and evaluation benchmarks.

Contribution

It introduces a novel recipe including data construction, training techniques, and evaluation methods specifically for long context alignment in large language models.

Findings

01

Outperforms existing methods by up to 30% on long context tasks.

02

Maintains proficiency in short, generic tasks.

03

Provides open-source code, data, and models.

Abstract

Extending large language models to effectively handle long contexts requires instruction fine-tuning on input sequences of similar length. To address this, we present LongAlign -- a recipe of the instruction data, training, and evaluation for long context alignment. First, we construct a long instruction-following dataset using Self-Instruct. To ensure the data diversity, it covers a broad range of tasks from various long context sources. Second, we adopt the packing and sorted batching strategies to speed up supervised fine-tuning on data with varied length distributions. Additionally, we develop a loss weighting method to balance the contribution to the loss across different sequences during packing training. Third, we introduce the LongBench-Chat benchmark for evaluating instruction-following capabilities on queries of 10k-100k in length. Experiments show that LongAlign outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thudm/longalign
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings