Training Long-Context LLMs Efficiently via Chunk-wise Optimization

Wenhao Li; Yuxin Zhang; Gen Luo; Daohai Yu; Rongrong Ji

arXiv:2505.16710·cs.LG·May 23, 2025

Training Long-Context LLMs Efficiently via Chunk-wise Optimization

Wenhao Li, Yuxin Zhang, Gen Luo, Daohai Yu, Rongrong Ji

PDF

Open Access 1 Repo

TL;DR

This paper introduces SeCO and SpaCO, memory-efficient training methods that enable long-context LLMs to be trained faster and with longer sequences, reducing costs and expanding practical usability.

Contribution

The paper proposes two novel chunk-wise optimization techniques, SeCO and SpaCO, that significantly improve training efficiency and sequence length handling for long-context LLMs.

Findings

01

SeCO increases maximum sequence length from 1K to 16K tokens.

02

SpaCO achieves up to 3x faster training speed than SeCO.

03

Both methods enable training long-context models on limited hardware.

Abstract

While long-context large language models (LLMs) exhibit remarkable document processing capabilities, their prohibitively high training costs often hinder customized applications. To mitigate this issue, we propose \textit{Sequential Chunk-wise Optimization} (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks. Each chunk independently constructs its computational graph and performs localized backpropagation, ensuring that only one chunk's forward activations are stored in memory. Building on SeCO, we further introduce \textit{Sparse Chunk-wise Optimization} (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks and incorporates a carefully designed compensation factor to ensure unbiased gradient estimation. SpaCO decouples the computational cost of backpropagation from the context length,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wenhaoli-xmu/seco
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Big Data and Digital Economy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings