Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training & Inference
Vasu Shyam, Anna Golubeva, Quentin Anthony

TL;DR
The paper introduces tensor and sequence parallelism (TSP), a unified parallel execution strategy that reduces memory usage in transformer training by combining weight and token sharding on a single device axis.
Contribution
TSP is a novel parallelism scheme that unifies tensor and sequence parallelism, reducing memory overhead and enabling efficient training of long-context models.
Findings
TSP reduces memory usage compared to traditional TP and SP.
TSP demonstrates competitive performance in benchmarks.
Theoretical analysis shows communication trade-offs of TSP.
Abstract
We present tensor and sequence parallelism (TSP), a parallel execution strategy that folds tensor parallelism and sequence parallelism onto a single device axis. In conventional multi-dimensional parallelism layouts, tensor parallelism (TP) shards model weights while sequence parallelism (SP) shards tokens, reducing per-device parameter or activation memory, respectively. Traditionally, each scheme is assigned its own mesh dimension. TSP instead assigns each rank both a weight shard and a sequence shard, reducing both parameter and activation memory along the same device axis. We implement this design with two runtime schedules. For attention, ranks iterate over broadcast parameter shards and reconstruct context through a sequence-wise key/value exchange. For gated MLPs, weight shards circulate in a ring while partial outputs accumulate locally. By sharding both weights and activations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
