A Learning Rate Path Switching Training Paradigm for Version Updates of Large Language Models
Zhihao Wang, Shiyu Liu, Jianheng Huang, Zheng Wang, Yixuan Liao,, Xiaoxin Chen, Junfeng Yao, Jinsong Su

TL;DR
This paper introduces a learning rate path switching training paradigm for LLM version updates, significantly reducing training costs while maintaining performance by optimizing learning rate schedules during updates.
Contribution
It proposes a novel training paradigm that switches learning rate paths during LLM updates, improving efficiency and performance over traditional methods.
Findings
Reduces total training cost to 58% of PTFS.
Maintains comparable pre-training performance.
Effective and generalizable across multiple LLM versions.
Abstract
Due to the continuous emergence of new data, version updates have become an indispensable requirement for Large Language Models (LLMs). The training paradigms for version updates of LLMs include pre-training from scratch (PTFS) and continual pre-training (CPT). Preliminary experiments demonstrate that PTFS achieves better pre-training performance, while CPT has lower training cost. Moreover, their performance and training cost gaps widen progressively with version updates. To investigate the underlying reasons for this phenomenon, we analyze the effect of learning rate adjustments during the two stages of CPT: preparing an initialization checkpoint and continual pre-training based on this checkpoint. We find that a large learning rate in the first stage and a complete learning rate decay process in the second stage are crucial for version updates of LLMs. Hence, we propose a learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling
