LLMTailor: A Layer-wise Tailoring Tool for Efficient Checkpointing of Large Language Models
Minqiu Sun, Xin Huang, Luanzheng Guo, Nathan R. Tallent, Kento Sato, Dong Dai

TL;DR
LLMTailor is a new framework that reduces storage and time overhead in checkpointing large language models by selectively saving only the most changed layers, without sacrificing model quality.
Contribution
It introduces a layer-wise selective checkpointing method that assembles checkpoints from different layers, a capability not available in existing tools.
Findings
Reduces checkpoint size by up to 4.3 times
Speeds up checkpointing by up to 2.8 times
Maintains model quality despite selective checkpointing
Abstract
Checkpointing is essential for fault tolerance in training large language models (LLMs). However, existing methods, regardless of their I/O strategies, periodically store the entire model and optimizer states, incurring substantial storage overhead and resource contention. Recent studies reveal that updates across LLM layers are highly non-uniform. Across training steps, some layers may undergo more significant changes, while others remain relatively stable or even unchanged. This suggests that selectively checkpointing only layers with significant updates could reduce overhead without harming training. Implementing such selective strategies requires fine-grained control over both weights and optimizer states, which no current tool provides. To address this gap, we propose \texttt{LLMTailor}, a checkpoint-merging framework that filters and assembles layers from different checkpoints to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Scientific Computing and Data Management · Topic Modeling
