Efficient Long-context Language Model Training by Core Attention Disaggregation
Yonghao Zhuang, Junda Chen, Bo Pang, Yi Gu, Yibo Zhu, Yimin Jiang, Ion Stoica, Eric Xing, Hao Zhang

TL;DR
This paper introduces core attention disaggregation (CAD), a technique that enhances long-context language model training by balancing attention computation across devices, significantly improving throughput and reducing stragglers.
Contribution
The paper proposes CAD, a novel method to decouple and distribute core attention computation, enabling efficient training of long-context models on large GPU clusters.
Findings
Up to 1.35x increase in training throughput on 512 GPUs.
Elimination of data and pipeline parallel stragglers.
Near-perfect compute and memory balance achieved.
Abstract
We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is enabled by two observations. First, core attention is stateless: it has no trainable parameters and only minimal transient data, so balancing reduces to scheduling compute-bound tasks. Second, it is composable: modern attention kernels retain high efficiency when processing fused batches of token-level shards with arbitrary lengths. CAD partitions core attention into token-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Advanced Neural Network Applications · Natural Language Processing Techniques
