Hyper: Distributed Cloud Processing for Large-Scale Deep Learning Tasks
Davit Buniatyan

TL;DR
This paper presents Hyper, a hybrid distributed cloud framework that enables scalable, cost-effective processing of large-scale deep learning tasks across multiple clouds and on-premise infrastructure, supporting CPU and GPU resources.
Contribution
Hyper introduces a unified, failure-tolerant distributed system for deep learning workloads that leverages multiple clouds and on-premise resources, reducing costs and increasing scalability.
Findings
Scalability demonstrated on 10,000 CPU cores and 300 GPU instances.
Supports large-scale pre-processing, training, hyperparameter search, and inference.
Achieves a processing power of 30 petaflops.
Abstract
Training and deploying deep learning models in real-world applications require processing large amounts of data. This is a challenging task when the amount of data grows to a hundred terabytes, or even, petabyte-scale. We introduce a hybrid distributed cloud framework with a unified view to multiple clouds and an on-premise infrastructure for processing tasks using both CPU and GPU compute instances at scale. The system implements a distributed file system and failure-tolerant task processing scheduler, independent of the language and Deep Learning framework used. It allows to utilize unstable cheap resources on the cloud to significantly reduce costs. We demonstrate the scalability of the framework on running pre-processing, distributed training, hyperparameter search and large-scale inference tasks utilizing 10,000 CPU cores and 300 GPU instances with the overall processing power of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
