Technical Report: Estimating Reliability of Workers for Cooperative   Distributed Computing

Seda Davtyan; Kishori M. Konwar; Alexander A. Shvartsman

arXiv:1407.0696·cs.DC·July 4, 2014

Technical Report: Estimating Reliability of Workers for Cooperative Distributed Computing

Seda Davtyan, Kishori M. Konwar, Alexander A. Shvartsman

PDF

Open Access

TL;DR

This paper introduces a decentralized, randomized algorithm that estimates the reliability of individual processors in internet supercomputing, enabling better fault tolerance without global coordination.

Contribution

It presents a novel algorithm for estimating each processor's probability of correctness in a distributed setting, applicable under adversarial conditions and without requiring global synchronization.

Findings

01

Estimates are accurate within specified bounds with high probability.

02

Algorithm operates efficiently with logarithmic time complexity in the number of processors.

03

Works effectively even when some processors crash or behave adversarially.

Abstract

Internet supercomputing is an approach to solving partitionable, computation-intensive problems by harnessing the power of a vast number of interconnected computers. For the problem of using network supercomputing to perform a large collection of independent tasks, prior work introduced a decentralized approach and provided randomized synchronous algorithms that perform all tasks correctly with high probability, while dealing with misbehaving or crash-prone processors. The main weaknesses of existing algorithms is that they assume either that the \emph{average} probability of a non-crashed processor returning incorrect results is inferior to $\frac{1}{2}$ , or that the probability of returning incorrect results is known to \emph{each} processor. Here we present a randomized synchronous distributed algorithm that tightly estimates the probability of each processor returning correct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Cloud Computing and Resource Management · Scheduling and Optimization Algorithms