Hydra: A System for Large Multi-Model Deep Learning
Kabir Nagrecha, Arun Kumar

TL;DR
Hydra is a system that enables efficient large multi-model deep learning on commodity GPUs by optimizing execution and resource management, significantly improving training throughput over existing frameworks.
Contribution
Hydra introduces a holistic approach to optimize multi-model deep learning workloads, combining model-parallel execution with scalable parameter offloading and task-parallel scheduling.
Findings
Hydra achieves 50-100% higher throughput than DeepSpeed and GPipe.
It enables training of 6-billion parameter models on a single commodity GPU.
Hydra demonstrates near-linear scaling in multi-GPU setups.
Abstract
Scaling up model depth and size is now a common approach to raise accuracy in many deep learning (DL) applications, as evidenced by the widespread success of multi-billion or even trillion parameter models in natural language processing (NLP) research. Despite success in DL research and at major technology companies, broader practical adoption of such large models among domain scientists and businesses is still bottlenecked by GPU memory limits, high training costs, and low GPU availability, even on public clouds. Model selection needs further compound these resource challenges: users often need to compare dozens of models with different hyper-parameters or neural architectures to suit their specific task and dataset. In this paper, we present Hydra, a system designed to tackle such challenges by enabling out-of-the-box scaling for multi-large-model DL workloads on even commodity GPUs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Machine Learning and Data Classification
MethodsAttention Is All You Need · Linear Layer · Cosine Annealing · Residual Connection · Dropout · Dense Connections · GPipe · Discriminative Fine-Tuning · Multi-Head Attention · Weight Decay
