Elixir: Train a Large Language Model on a Small GPU Cluster

Haichen Huang; Jiarui Fang; Hongxin Liu; Shenggui Li; Yang; You

arXiv:2212.05339·cs.DC·June 1, 2023·1 cites

Elixir: Train a Large Language Model on a Small GPU Cluster

Haichen Huang, Jiarui Fang, Hongxin Liu, Shenggui Li, Yang, You

PDF

Open Access 2 Repos

TL;DR

Elixir is an automated system that optimizes large language model training on small GPU clusters by intelligently combining partitioning and offloading techniques, significantly improving training speed without requiring expert tuning.

Contribution

Elixir introduces an automated, profiling-based approach to optimize large-model training on limited hardware, surpassing existing manual tuning methods.

Findings

01

Achieves up to 3.4× speedup on GPT-2 models

02

Outperforms current state-of-the-art baseline methods

03

Automates the optimization process for resource-constrained training

Abstract

In recent years, large language models have achieved great success due to their unprecedented size. However, training these models poses a challenge for most researchers as it requires a substantial number of GPUs. To reduce GPU memory usage, memory partitioning, and memory offloading have been proposed. These approaches eliminate memory redundancies and offload memory usage to the CPU and NVMe memory, respectively, enabling training on small GPU clusters. However, directly deploying these solutions often leads to suboptimal efficiency. Only experienced experts can unleash the full potential of hardware by carefully tuning the distributed configuration. Thus, we present a novel solution, Elixir, which automates efficient large-model training based on pre-runtime model profiling. Elixir aims to identify the optimal combination of partitioning and offloading techniques to maximize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Layer Normalization · Byte Pair Encoding · Softmax · Linear Warmup With Cosine Annealing · Adam