Propius: A Platform for Collaborative Machine Learning across the Edge and the Cloud
Eric Ding

TL;DR
Propius is a scalable platform designed for efficient resource management and collaboration in distributed machine learning across edge and cloud environments, addressing heterogeneity and scalability challenges.
Contribution
It introduces a novel system architecture with a control and data plane to improve resource sharing and scalability in collaborative ML.
Findings
Propius achieves up to 1.88x better resource utilization.
It improves throughput by up to 2.76 times.
Reduces job completion time by up to 1.26 times.
Abstract
Collaborative Machine Learning is a paradigm in the field of distributed machine learning, designed to address the challenges of data privacy, communication overhead, and model heterogeneity. There have been significant advancements in optimization and communication algorithm design and ML hardware that enables fair, efficient and secure collaborative ML training. However, less emphasis is put on collaborative ML infrastructure development. Developers and researchers often build server-client systems for a specific collaborative ML use case, which is not scalable and reusable. As the scale of collaborative ML grows, the need for a scalable, efficient, and ideally multi-tenant resource management system becomes more pressing. We propose a novel system, Propius, that can adapt to the heterogeneity of client machines, and efficiently manage and control the computation flow between ML jobs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Scientific Computing and Data Management
