StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs
Qijun Luo, Mengqi Li, Lei Zhao, Xiao Li

TL;DR
StreamBP introduces a memory-efficient, exact backpropagation method that significantly extends the maximum sequence length for training large language models, reducing memory costs and improving training speed.
Contribution
The paper presents StreamBP, a novel layer-wise, linear decomposition approach for backpropagation that reduces memory usage and computational costs for long sequence training in language models.
Findings
Scales sequence length by 2.8-5.5 times compared to gradient checkpointing.
Achieves faster backpropagation with comparable or less computational cost.
Supports multi-GPU training with a communication-efficient distributed implementation.
Abstract
Training language models on long sequence data is a demanding requirement for enhancing the model's capability on complex tasks, e.g., long-chain reasoning. However, as the sequence length scales up, the memory cost for storing activation values becomes huge during the Backpropagation (BP) process, even with the application of gradient checkpointing technique. To tackle this challenge, we propose a memory-efficient and exact BP method called StreamBP, which performs a linear decomposition of the chain rule along the sequence dimension in a layer-wise manner, significantly reducing the memory cost of activation values and logits. The proposed method is applicable to common objectives such as SFT, GRPO, and DPO. From an implementation perspective, StreamBP achieves less computational FLOPs and faster BP speed by leveraging the causal structure of the language model. Compared to gradient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Machine Learning and Algorithms
