High-Performance Distributed ML at Scale through Parameter Server   Consistency Models

Wei Dai; Abhimanu Kumar; Jinliang Wei; Qirong Ho; Garth Gibson; Eric; P. Xing

arXiv:1410.8043·cs.LG·October 31, 2014·72 cites

High-Performance Distributed ML at Scale through Parameter Server Consistency Models

Wei Dai, Abhimanu Kumar, Jinliang Wei, Qirong Ho, Garth Gibson, Eric, P. Xing

PDF

Open Access

TL;DR

This paper analyzes the theoretical and empirical aspects of consistency models in distributed Parameter Server frameworks for Machine Learning, proposing improvements to accelerate convergence and ensure correctness.

Contribution

It provides a theoretical study of consistency models in PS systems, introduces an improved eager communication mechanism, and implements a new system to enhance ML training speed.

Findings

01

Certain consistency models guarantee correct ML outputs.

02

The eager communication mechanism improves convergence speed.

03

The new PS system outperforms existing frameworks in training time.

Abstract

As Machine Learning (ML) applications increase in data size and model complexity, practitioners turn to distributed clusters to satisfy the increased computational and memory demands. Unfortunately, effective use of clusters for ML requires considerable expertise in writing distributed code, while highly-abstracted frameworks like Hadoop have not, in practice, approached the performance seen in specialized ML implementations. The recent Parameter Server (PS) paradigm is a middle ground between these extremes, allowing easy conversion of single-machine parallel ML applications into distributed ones, while maintaining high throughput through relaxed "consistency models" that allow inconsistent parameter reads. However, due to insufficient theoretical study, it is not clear which of these consistency models can really ensure correct ML algorithm output; at the same time, there remain many…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Stochastic Gradient Optimization Techniques · IoT and Edge/Fog Computing