SPIRT: A Fault-Tolerant and Reliable Peer-to-Peer Serverless ML Training   Architecture

Amine Barrak; Mayssa Jaziri; Ranim Trabelsi; Fehmi Jaafar; Fabio; Petrillo

arXiv:2309.14148·cs.DC·September 26, 2023·1 cites

SPIRT: A Fault-Tolerant and Reliable Peer-to-Peer Serverless ML Training Architecture

Amine Barrak, Mayssa Jaziri, Ranim Trabelsi, Fehmi Jaafar, Fabio, Petrillo

PDF

Open Access

TL;DR

SPIRT introduces a fault-tolerant, secure, and scalable serverless P2P architecture for distributed machine learning, significantly reducing update times and maintaining high accuracy under peer failures and attacks.

Contribution

This paper presents SPIRT, the first fault-tolerant, secure, and scalable serverless P2P ML training architecture utilizing RedisAI for efficiency and robustness.

Findings

01

82% reduction in model update time

02

Resilience against peer failures and Byzantine attacks

03

Effective integration of new peers in P2P network

Abstract

The advent of serverless computing has ushered in notable advancements in distributed machine learning, particularly within parameter server-based architectures. Yet, the integration of serverless features within peer-to-peer (P2P) distributed networks remains largely uncharted. In this paper, we introduce SPIRT, a fault-tolerant, reliable, and secure serverless P2P ML training architecture. designed to bridge this existing gap. Capitalizing on the inherent robustness and reliability innate to P2P systems, SPIRT employs RedisAI for in-database operations, leading to an 82\% reduction in the time required for model updates and gradient averaging across a variety of models and batch sizes. This architecture showcases resilience against peer failures and adeptly manages the integration of new peers, thereby highlighting its fault-tolerant characteristics and scalability. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Cryptography and Data Security · Adversarial Robustness in Machine Learning