Training Long-Context LLMs Efficiently via Chunk-wise Optimization
Wenhao Li, Yuxin Zhang, Gen Luo, Daohai Yu, Rongrong Ji

TL;DR
This paper introduces SeCO and SpaCO, memory-efficient training methods that enable long-context LLMs to be trained faster and with longer sequences, reducing costs and expanding practical usability.
Contribution
The paper proposes two novel chunk-wise optimization techniques, SeCO and SpaCO, that significantly improve training efficiency and sequence length handling for long-context LLMs.
Findings
SeCO increases maximum sequence length from 1K to 16K tokens.
SpaCO achieves up to 3x faster training speed than SeCO.
Both methods enable training long-context models on limited hardware.
Abstract
While long-context large language models (LLMs) exhibit remarkable document processing capabilities, their prohibitively high training costs often hinder customized applications. To mitigate this issue, we propose \textit{Sequential Chunk-wise Optimization} (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks. Each chunk independently constructs its computational graph and performs localized backpropagation, ensuring that only one chunk's forward activations are stored in memory. Building on SeCO, we further introduce \textit{Sparse Chunk-wise Optimization} (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks and incorporates a carefully designed compensation factor to ensure unbiased gradient estimation. SpaCO decouples the computational cost of backpropagation from the context length,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Big Data and Digital Economy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
