Technical Report: Estimating Reliability of Workers for Cooperative Distributed Computing
Seda Davtyan, Kishori M. Konwar, Alexander A. Shvartsman

TL;DR
This paper introduces a decentralized, randomized algorithm that estimates the reliability of individual processors in internet supercomputing, enabling better fault tolerance without global coordination.
Contribution
It presents a novel algorithm for estimating each processor's probability of correctness in a distributed setting, applicable under adversarial conditions and without requiring global synchronization.
Findings
Estimates are accurate within specified bounds with high probability.
Algorithm operates efficiently with logarithmic time complexity in the number of processors.
Works effectively even when some processors crash or behave adversarially.
Abstract
Internet supercomputing is an approach to solving partitionable, computation-intensive problems by harnessing the power of a vast number of interconnected computers. For the problem of using network supercomputing to perform a large collection of independent tasks, prior work introduced a decentralized approach and provided randomized synchronous algorithms that perform all tasks correctly with high probability, while dealing with misbehaving or crash-prone processors. The main weaknesses of existing algorithms is that they assume either that the \emph{average} probability of a non-crashed processor returning incorrect results is inferior to , or that the probability of returning incorrect results is known to \emph{each} processor. Here we present a randomized synchronous distributed algorithm that tightly estimates the probability of each processor returning correct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Cloud Computing and Resource Management · Scheduling and Optimization Algorithms
