ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention

Yuhong Chou; Zehao Liu; Ruijie Zhu; Xinyi Wan; Tianjian Li; Congying Chu; Qian Liu; Jibin Wu; Zejun Ma

arXiv:2507.01004·cs.LG·July 3, 2025

ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention

Yuhong Chou, Zehao Liu, Ruijie Zhu, Xinyi Wan, Tianjian Li, Congying Chu, Qian Liu, Jibin Wu, Zejun Ma

PDF

Open Access

TL;DR

ZeCO introduces a novel sequence parallelism method with zero communication overhead for linear attention models, enabling efficient training of ultra-long sequences on large-scale hardware with near-linear scalability.

Contribution

ZeCO presents a new SP method utilizing All-Scan to eliminate communication overhead, achieving near-linear scalability for long sequence training in linear attention models.

Findings

01

ZeCO achieves 60% speedup over SOTA on 256 GPUs with 8M sequences.

02

All-Scan provides minimal communication, enabling efficient parallelism.

03

ZeCO allows training of 1M sequence length models across 64 devices with comparable time to 16k sequences on a single device.

Abstract

Linear attention mechanisms deliver significant advantages for Large Language Models (LLMs) by providing linear computational complexity, enabling efficient processing of ultra-long sequences (e.g., 1M context). However, existing Sequence Parallelism (SP) methods, essential for distributing these workloads across devices, become the primary bottleneck due to substantial communication overhead. In this paper, we introduce ZeCO (Zero Communication Overhead) sequence parallelism for linear attention models, a new SP method designed to overcome these limitations and achieve end-to-end near-linear scalability for long sequence training. For example, training a model with a 1M sequence length across 64 devices using ZeCO takes roughly the same time as training with an 16k sequence on a single device. At the heart of ZeCO lies All-Scan, a new collective communication primitive. All-Scan…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Topic Modeling · Big Data and Digital Economy