Saturn: An Optimized Data System for Large Model Deep Learning Workloads

Kabir Nagrecha; Arun Kumar

arXiv:2309.01226·cs.LG·December 14, 2023

Saturn: An Optimized Data System for Large Model Deep Learning Workloads

Kabir Nagrecha, Arun Kumar

PDF

Open Access 1 Repo

TL;DR

Saturn is a new data system that automates and optimizes the selection of parallelism strategies, resource allocation, and scheduling for large deep learning models, significantly reducing training time.

Contribution

The paper introduces Saturn, a data system that jointly addresses parallelism, resource allocation, and scheduling for large models using an MILP formulation and empirical profiling.

Findings

01

Saturn reduces model selection runtimes by 39-49%.

02

The MILP-based approach outperforms baseline heuristics.

03

An extensible template and empirical profiler enhance system effectiveness.

Abstract

Large language models such as GPT-3 & ChatGPT have transformed deep learning (DL), powering applications that have captured the public's imagination. These models are rapidly being adopted across domains for analytics on various modalities, often by finetuning pre-trained base models. Such models need multiple GPUs due to both their size and computational load, driving the development of a bevy of "model parallelism" techniques & tools. Navigating such parallelism choices, however, is a new burden for end users of DL such as data scientists, domain scientists, etc. who may lack the necessary systems knowhow. The need for model selection, which leads to many models to train due to hyper-parameter tuning or layer-wise finetuning, compounds the situation with two more burdens: resource apportioning and scheduling. In this work, we tackle these three burdens for DL users in a unified manner…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

knagrecha/saturn
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Machine Learning and Data Classification

Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Softmax · Layer Normalization · Linear Layer · Dense Connections · Attention Dropout