Laminar: A Probe-First Scheduling Paradigm with Deterministic Runtime Survival
Zhengyan Chu

TL;DR
Laminar is a decentralized scheduling framework for exascale GPU clusters that improves runtime survival and workload management through probabilistic flow splitting and a node-local survival layer.
Contribution
It introduces Laminar, a novel probe-first, execute-later scheduling paradigm with Airlock for runtime survival, enhancing lifecycle-aware scheduling at scale.
Findings
Enables near O(1) control-plane work complexity.
Provides bounded, priority-ordered runtime survival.
Improves preservation of long-resident workloads under pressure.
Abstract
In exascale-oriented GPU clusters, rigid-topology jobs leave behind a fragmented post-landing ecology in which long-resident workloads and highly transient tasks compete for unstable residual capacity. Existing centralized, hierarchical, and local-first decentralized schedulers incur growing coordination and retry-amplification costs in this regime and typically stop their explicit responsibility at execution start, leaving runtime survival to indiscriminate host-level OOM heuristics. We present Laminar, a decentralized probe-first, execute-later scheduling paradigm that keeps hot-path control-plane work near through Zone-level probabilistic flow splitting, bounded in-Zone probing by persistent lightweight agents, and node-local arbitration. Laminar further introduces Airlock, a bounded node-local runtime-survival layer that converts severe memory pressure into an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
