# Towards Distributed Machine Learning in Shared Clusters: A   Dynamically-Partitioned Approach

**Authors:** Peng Sun, Yonggang Wen, Ta Nguyen Binh Duong, Shengen Yan

arXiv: 1704.06738 · 2017-06-19

## TL;DR

This paper introduces Dorm, a cluster management system that dynamically partitions resources to improve utilization, fairness, and performance for distributed machine learning workloads with minimal overhead.

## Contribution

Dorm is a novel cluster management system that uses dynamic partitioning and resource optimization to enhance distributed ML workload efficiency and fairness.

## Key findings

- Resource utilization increased by up to 2.32 times.
- Fairness loss reduced by up to 1.52 times.
- Distributed ML applications sped up by up to 2.72 times.

## Abstract

Many cluster management systems (CMSs) have been proposed to share a single cluster with multiple distributed computing systems. However, none of the existing approaches can handle distributed machine learning (ML) workloads given the following criteria: high resource utilization, fair resource allocation and low sharing overhead. To solve this problem, we propose a new CMS named Dorm, incorporating a dynamically-partitioned cluster management mechanism and an utilization-fairness optimizer. Specifically, Dorm uses the container-based virtualization technique to partition a cluster, runs one application per partition, and can dynamically resize each partition at application runtime for resource efficiency and fairness. Each application directly launches its tasks on the assigned partition without petitioning for resources frequently, so Dorm imposes flat sharing overhead. Extensive performance evaluations showed that Dorm could simultaneously increase the resource utilization by a factor of up to 2.32, reduce the fairness loss by a factor of up to 1.52, and speed up popular distributed ML applications by a factor of up to 2.72, compared to existing approaches. Dorm's sharing overhead is less than 5% in most cases.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1704.06738/full.md

## Figures

21 figures with captions in the complete paper: https://tomesphere.com/paper/1704.06738/full.md

## References

18 references — full list in the complete paper: https://tomesphere.com/paper/1704.06738/full.md

---
Source: https://tomesphere.com/paper/1704.06738