Elixir: Train a Large Language Model on a Small GPU Cluster
Haichen Huang, Jiarui Fang, Hongxin Liu, Shenggui Li, Yang, You

TL;DR
Elixir is an automated system that optimizes large language model training on small GPU clusters by intelligently combining partitioning and offloading techniques, significantly improving training speed without requiring expert tuning.
Contribution
Elixir introduces an automated, profiling-based approach to optimize large-model training on limited hardware, surpassing existing manual tuning methods.
Findings
Achieves up to 3.4× speedup on GPT-2 models
Outperforms current state-of-the-art baseline methods
Automates the optimization process for resource-constrained training
Abstract
In recent years, large language models have achieved great success due to their unprecedented size. However, training these models poses a challenge for most researchers as it requires a substantial number of GPUs. To reduce GPU memory usage, memory partitioning, and memory offloading have been proposed. These approaches eliminate memory redundancies and offload memory usage to the CPU and NVMe memory, respectively, enabling training on small GPU clusters. However, directly deploying these solutions often leads to suboptimal efficiency. Only experienced experts can unleash the full potential of hardware by carefully tuning the distributed configuration. Thus, we present a novel solution, Elixir, which automates efficient large-model training based on pre-runtime model profiling. Elixir aims to identify the optimal combination of partitioning and offloading techniques to maximize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Layer Normalization · Byte Pair Encoding · Softmax · Linear Warmup With Cosine Annealing · Adam
