Fault Tolerance in Distributed Neural Computing
Anton Kulakov, Mark Zwolinski, Jeff Reeve

TL;DR
This paper investigates the fault-tolerance of distributed neural networks with decentralized control, analyzing their robustness to hardware faults and communication failures during learning and operation.
Contribution
It demonstrates that distributed neural networks with local learning rules can maintain functionality despite hardware and communication faults, offering insights into scalable fault-tolerant systems.
Findings
Neural networks exhibit intrinsic fault-tolerance during learning and operation.
Fault injection increases overhead but does not compromise overall system performance.
Distributed, local-rule-based networks are resilient to hardware and communication failures.
Abstract
With the increasing complexity of computing systems, complete hardware reliability can no longer be guaranteed. We need, however, to ensure overall system reliability. One of the most important features of artificial neural networks is their intrinsic fault-tolerance. The aim of this work is to investigate whether such networks have features that can be applied to wider computational systems. This paper presents an analysis, in both the learning and operational phases, of a distributed feed-forward neural network with decentralised event-driven time management, which is insensitive to intermittent faults caused by unreliable communication or faulty hardware components. The learning rules used in the model are local in space and time, which allows efficient scalable distributed implementation. We investigate the overhead caused by injected faults and analyse the sensitivity to limited…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
