Garfield: System Support for Byzantine Machine Learning
Rachid Guerraoui, Arsany Guirguis, J\'er\'emy Max Plassmann, Anton, Alexandre Ragot, S\'ebastien Rouault

TL;DR
Garfield is a library that enables Byzantine-resilient machine learning by supporting various architectures and communication patterns, reducing coding effort, and analyzing the practical costs of Byzantine resilience in ML applications.
Contribution
It introduces Garfield, a novel object-oriented library that simplifies implementing Byzantine-resilient ML across different architectures and hardware, with detailed cost analysis.
Findings
Byzantine resilience causes accuracy loss unlike crash resilience.
Communication overhead exceeds robust aggregation costs.
Tolerating Byzantine servers is more costly than Byzantine workers.
Abstract
We present Garfield, a library to transparently make machine learning (ML) applications, initially built with popular (but fragile) frameworks, e.g., TensorFlow and PyTorch, Byzantine-resilient. Garfield relies on a novel object-oriented design, reducing the coding effort, and addressing the vulnerability of the shared-graph architecture followed by classical ML frameworks. Garfield encompasses various communication patterns and supports computations on CPUs and GPUs, allowing addressing the general question of the very practical cost of Byzantine resilience in SGD-based ML applications. We report on the usage of Garfield on three main ML architectures: (a) a single server with multiple workers, (b) several servers and workers, and (c) peer-to-peer settings. Using Garfield, we highlight several interesting facts about the cost of Byzantine resilience. In particular, (a) Byzantine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Cryptography and Data Security · Cloud Data Security Solutions
