On Harnessing Idle Compute at the Edge for Foundation Model Training
Leyang Xue, Meghana Madhyastha, Myungjin Lee, Amos Storkey, Randal Burns, Mahesh K. Marina

TL;DR
Cleave is a novel edge-training system that leverages the asymmetric I/O pattern of GEMMs to enable scalable, efficient, and fault-tolerant foundation model training on heterogeneous edge devices, rivaling cloud performance.
Contribution
The paper introduces Cleave, a new edge training architecture that exploits I/O asymmetry and a parameter-server design to improve scalability, efficiency, and fault tolerance.
Findings
Cleave achieves cloud-comparable GPU training performance.
Cleave outperforms state-of-the-art edge-training methods by 4--10x in per-batch runtime.
Cleave scales to thousands of heterogeneous devices and recovers from failures at least 100x faster.
Abstract
The foundation-model ecosystem remains highly centralized because training requires immense compute resources and is therefore largely limited to large cloud operators. Edge-assisted foundation model training that harnesses spare compute on edge devices offers a more democratized alternative. However, existing edge-training approaches fall short: they struggle to match cloud-training performance, scale to larger models, fit within device memory limits, or keep communication overhead manageable. They also do not handle device heterogeneity and churn satisfactorily. We introduce Cleave, built on a structural insight: each GEMM has an asymmetric I/O pattern -- its input matrices, sent over downlink, are much larger than the partial output blocks returned over uplink -- matching edge networks where downlink bandwidth exceeds uplink by 2--10x. Exploiting this alignment with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
