LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its   Hybrid

Weigao Sun; Disen Lan; Yiran Zhong; Xiaoye Qu; Yu Cheng

arXiv:2502.07563·cs.LG·February 12, 2025

LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid

Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng

PDF

Open Access 1 Repo

TL;DR

LASP-2 introduces a novel sequence parallelism method that optimizes communication and computation for linear attention transformers, enabling efficient training on very long sequences with significant speed improvements.

Contribution

LASP-2 rethinks minimal communication in sequence parallelism for linear attention, reorganizing workflows to improve scalability and efficiency in distributed training.

Findings

01

Achieves 15.2% training speedup over LASP.

02

Attains 36.6% faster training than Ring Attention.

03

Supports training on sequences up to 2048K length across 64 GPUs.

Abstract

Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy, which results in lower computation parallelism, limits their scalability for longer sequences in distributed systems. In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences. Compared to previous work LASP, LASP-2 rethinks the minimal communication requirement for SP on linear attention layers, reorganizes the whole communication-computation workflow of LASP. In this way, only one single AllGather collective communication…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opensparsellms/linear-moe
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Parallel Computing and Optimization Techniques · Anomaly Detection Techniques and Applications

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings