A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability
Ruitao Liu (1), Xinyang Tian (1), Shuo Chen (1), Tingrui Zhang (1), Guang Yang (1), Alan Zhao (2), Wei Xu (1) ((1) Tsinghua University, (2) Scitix AI)

TL;DR
The paper introduces RRFP, a runtime system that dynamically manages pipeline parallel training schedules to better handle runtime variability, significantly improving efficiency on large-scale GPU workloads.
Contribution
It proposes a novel readiness-driven runtime approach that treats schedules as non-binding hints, enabling better resource utilization during pipeline training.
Findings
RRFP achieves up to 1.77× speedup on language workloads.
RRFP achieves up to 2.77× speedup on multimodal workloads.
Outperforms existing systems by up to 1.84× in cross-framework tests.
Abstract
Pipeline parallelism is a key technique for scaling large-model training, but modern workloads exhibit runtime variability in computation and communication. Existing pipeline systems typically consume static, profiled, or adaptively generated schedules as pre-committed execution orders. When realized task readiness diverges from the pre-committed order, stages may wait for not-yet-ready work even though other executable work is available, creating stage misalignment, idle bubbles, and reduced utilization. We present Runtime-Readiness-First Pipeline (RRFP), a readiness-driven runtime for pipeline-parallel training. RRFP changes how schedules are consumed at runtime: instead of treating a schedule as a sequence that stages must wait to follow, it treats the schedule as a non-binding hint order for ranking currently ready work. To support this model, RRFP combines message-driven…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
