Checkpoint Merging via Bayesian Optimization in LLM Pretraining

Deyuan Liu; Zecheng Wang; Bingning Wang; Weipeng Chen; Chunshan Li; Zhiying Tu; Dianhui Chu; Bo Li; Dianbo Sui

arXiv:2403.19390·cs.CL·June 4, 2025·1 cites

Checkpoint Merging via Bayesian Optimization in LLM Pretraining

Deyuan Liu, Zecheng Wang, Bingning Wang, Weipeng Chen, Chunshan Li, Zhiying Tu, Dianhui Chu, Bo Li, Dianbo Sui

PDF

Open Access

TL;DR

This paper introduces a Bayesian optimization-based checkpoint merging method for large language model pretraining, aiming to reduce computational costs while maintaining robust performance across various domains.

Contribution

It presents a novel checkpoint merging technique utilizing Bayesian optimization to improve pretraining efficiency and generalization in large language models.

Findings

01

Enhances pretraining with minimal additional cost

02

Demonstrates robust cross-domain generalization

03

Achieves significant benefits through checkpoint merging

Abstract

The rapid proliferation of large language models (LLMs) such as GPT-4 and Gemini underscores the intense demand for resources during their training processes, posing significant challenges due to substantial computational and environmental costs. To alleviate this issue, we propose checkpoint merging in pretraining LLM. This method utilizes LLM checkpoints with shared training trajectories, and is rooted in an extensive search space exploration for the best merging weight via Bayesian optimization. Through various experiments, we demonstrate that: (1) Our proposed methodology exhibits the capacity to augment pretraining, presenting an opportunity akin to obtaining substantial benefits at minimal cost; (2) Our proposed methodology, despite requiring a given held-out dataset, still demonstrates robust generalization capabilities across diverse domains, a pivotal aspect in pretraining.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsManufacturing Process and Optimization · Scheduling and Optimization Algorithms

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Layer Normalization · Byte Pair Encoding · Softmax · Dropout · Multi-Head Attention