Exploiting Stragglers in Distributed Computing Systems with Task Grouping
Tharindu Adikari, Haider Al-Lawati, Jason Lam, Zhenhua Hu, Stark C., Draper

TL;DR
This paper introduces a novel approach to handle stragglers in distributed systems by exploiting their partial work, reducing task completion times through increased work granularity and update frequency, validated on simulations and real cloud environments.
Contribution
It proposes a new method to utilize straggler work instead of discarding it, improving efficiency in distributed computing.
Findings
Reduces task completion time in simulated clusters.
Effective on Amazon EC2 with Apache Hadoop.
Outperforms traditional work replication methods.
Abstract
We consider the problem of stragglers in distributed computing systems. Stragglers, which are compute nodes that unpredictably slow down, often increase the completion times of tasks. One common approach to mitigating stragglers is work replication, where only the first completion among replicated tasks is accepted, discarding the others. However, discarding work leads to resource wastage. In this paper, we propose a method for exploiting the work completed by stragglers rather than discarding it. The idea is to increase the granularity of the assigned work, and to increase the frequency of worker updates. We show that the proposed method reduces the completion time of tasks via experiments performed on a simulated cluster as well as on Amazon EC2 with Apache Hadoop.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Cloud Computing and Resource Management · IoT and Edge/Fog Computing
