Speculative Container Scheduling for Deep Learning Applications in a Kubernetes Cluster
Ying Mao, Yuqi Fu, Wenjia Zheng, Long Cheng, Qingzhi Liu, and Dingwen, Tao

TL;DR
This paper introduces SpeCon, a container scheduling algorithm optimized for deep learning workloads in Kubernetes, which improves training completion times by speculatively migrating slow models to enhance resource utilization.
Contribution
The paper proposes SpeCon, a novel speculative container scheduler tailored for deep learning applications, with algorithms to monitor training progress and migrate slow models, improving efficiency.
Findings
SpeCon reduces individual job completion time by up to 41.5%.
SpeCon improves system-wide performance by 14.8%.
SpeCon decreases makespan by 24.7%.
Abstract
In the past decade, we have witnessed a dramatically increasing volume of data collected from varied sources. The explosion of data has transformed the world as more information is available for collection and analysis than ever before. To maximize the utilization, various machine and deep learning models have been developed, e.g. CNN [1] and RNN [2], to study data and extract valuable information from different perspectives. While data-driven applications improve countless products, training models for hyperparameter tuning is still a time-consuming and resource-intensive process. Cloud computing provides infrastructure support for the training of deep learning applications. The cloud service providers, such as Amazon Web Services [3], create an isolated virtual environment (virtual machines and containers) for clients, who share physical resources, e.g., CPU and memory. On the cloud,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
