Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP

Yilong Zhao; Xiaonan Nie; Kan Zhu; Shuang Ma; Zhichao Lai; Hongxiang Hao; Yang Zhou; Baris Kasikci; Ion Stoica

arXiv:2605.08524·cs.DC·May 12, 2026

Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP

Yilong Zhao, Xiaonan Nie, Kan Zhu, Shuang Ma, Zhichao Lai, Hongxiang Hao, Yang Zhou, Baris Kasikci, Ion Stoica

PDF

TL;DR

This paper introduces FCP, a flexible context parallelism method that improves scalability and efficiency in foundation model pretraining by better handling sequence length variation through block-level sharding and peer-to-peer communication.

Contribution

FCP enables arbitrary peer-to-peer communication and block-level sharding, achieving high efficiency and workload balance in large-scale model pretraining.

Findings

01

FCP attains near-linear scalability on up to 256 GPUs.

02

FCP improves attention MFU by 1.13x-2.21x.

03

FCP effectively handles sequence length variation with bin-packing.

Abstract

Context parallelism (CP) has been widely adopted to support the growing context length in foundation model pretraining. However, existing designs fail to handle the large variation in sequence length from training datasets, resulting in suboptimal performance. These methods often over-shard short sequences, leading to compute inefficiency and excessive communication, or process long and short sequences separately without proper bin-packing, causing workload imbalance. In this paper, we propose FCP, a flexible context parallelism paradigm that shards and schedules sequences at block-level granularity. Instead of relying on rigid communication topologies such as ring, FCP enables arbitrary peer-to-peer communication, allowing flexible placement of sequence blocks across workers. By bin-packing blocks from both short and long sequences, FCP achieves both high compute efficiency and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.