# Coded TeraSort

**Authors:** Songze Li, Sucha Supittayapornpong, Mohammad Ali Maddah-Ali, and A., Salman Avestimehr

arXiv: 1702.04850 · 2017-02-17

## TL;DR

Coded TeraSort introduces a distributed sorting algorithm that uses structured data redundancy and in-network coding to significantly reduce execution time in Hadoop MapReduce, achieving nearly 2 to 3.4 times speedup.

## Contribution

It proposes a novel distributed sorting algorithm that leverages in-network coding with structured data redundancy to improve performance over traditional TeraSort.

## Key findings

- Achieves 1.97x to 3.39x speedup on Amazon EC2 clusters.
- Effectively overcomes data shuffling bottleneck in Hadoop TeraSort.
- Demonstrates practical benefits of in-network coding in distributed sorting.

## Abstract

We focus on sorting, which is the building block of many machine learning algorithms, and propose a novel distributed sorting algorithm, named Coded TeraSort, which substantially improves the execution time of the TeraSort benchmark in Hadoop MapReduce. The key idea of Coded TeraSort is to impose structured redundancy in data, in order to enable in-network coding opportunities that overcome the data shuffling bottleneck of TeraSort. We empirically evaluate the performance of CodedTeraSort algorithm on Amazon EC2 clusters, and demonstrate that it achieves 1.97x - 3.39x speedup, compared with TeraSort, for typical settings of interest.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1702.04850/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/1702.04850/full.md

## References

25 references — full list in the complete paper: https://tomesphere.com/paper/1702.04850/full.md

---
Source: https://tomesphere.com/paper/1702.04850