LLMTailor: A Layer-wise Tailoring Tool for Efficient Checkpointing of Large Language Models

Minqiu Sun; Xin Huang; Luanzheng Guo; Nathan R. Tallent; Kento Sato; Dong Dai

arXiv:2602.22158·cs.DC·February 26, 2026

LLMTailor: A Layer-wise Tailoring Tool for Efficient Checkpointing of Large Language Models

Minqiu Sun, Xin Huang, Luanzheng Guo, Nathan R. Tallent, Kento Sato, Dong Dai

PDF

Open Access

TL;DR

LLMTailor is a new framework that reduces storage and time overhead in checkpointing large language models by selectively saving only the most changed layers, without sacrificing model quality.

Contribution

It introduces a layer-wise selective checkpointing method that assembles checkpoints from different layers, a capability not available in existing tools.

Findings

01

Reduces checkpoint size by up to 4.3 times

02

Speeds up checkpointing by up to 2.8 times

03

Maintains model quality despite selective checkpointing

Abstract

Checkpointing is essential for fault tolerance in training large language models (LLMs). However, existing methods, regardless of their I/O strategies, periodically store the entire model and optimizer states, incurring substantial storage overhead and resource contention. Recent studies reveal that updates across LLM layers are highly non-uniform. Across training steps, some layers may undergo more significant changes, while others remain relatively stable or even unchanged. This suggests that selectively checkpointing only layers with significant updates could reduce overhead without harming training. Implementing such selective strategies requires fine-grained control over both weights and optimizer states, which no current tool provides. To address this gap, we propose \texttt{LLMTailor}, a checkpoint-merging framework that filters and assembles layers from different checkpoints to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed systems and fault tolerance · Scientific Computing and Data Management · Topic Modeling