Efficient Long-context Language Model Training by Core Attention Disaggregation

Yonghao Zhuang; Junda Chen; Bo Pang; Yi Gu; Yibo Zhu; Yimin Jiang; Ion Stoica; Eric Xing; Hao Zhang

arXiv:2510.18121·cs.LG·October 22, 2025

Efficient Long-context Language Model Training by Core Attention Disaggregation

Yonghao Zhuang, Junda Chen, Bo Pang, Yi Gu, Yibo Zhu, Yimin Jiang, Ion Stoica, Eric Xing, Hao Zhang

PDF

Open Access

TL;DR

This paper introduces core attention disaggregation (CAD), a technique that enhances long-context language model training by balancing attention computation across devices, significantly improving throughput and reducing stragglers.

Contribution

The paper proposes CAD, a novel method to decouple and distribute core attention computation, enabling efficient training of long-context models on large GPU clusters.

Findings

01

Up to 1.35x increase in training throughput on 512 GPUs.

02

Elimination of data and pipeline parallel stragglers.

03

Near-perfect compute and memory balance achieved.

Abstract

We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is enabled by two observations. First, core attention is stateless: it has no trainable parameters and only minimal transient data, so balancing reduces to scheduling compute-bound tasks. Second, it is composable: modern attention kernels retain high efficiency when processing fused batches of token-level shards with arbitrary lengths. CAD partitions core attention into token-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Advanced Neural Network Applications · Natural Language Processing Techniques