Decoupled Relative Learning Rate Schedules

Jan Ludziejewski; Jan Ma{\l}a\'snicki; Maciej Pi\'oro; Micha{\l} Krutul; Kamil Ciebiera; Maciej Stefaniak; Jakub Krajewski; Piotr Sankowski; Marek Cygan; Kamil Adamczewski; Sebastian Jaszczur

arXiv:2507.03526·cs.LG·July 8, 2025

Decoupled Relative Learning Rate Schedules

Jan Ludziejewski, Jan Ma{\l}a\'snicki, Maciej Pi\'oro, Micha{\l} Krutul, Kamil Ciebiera, Maciej Stefaniak, Jakub Krajewski, Piotr Sankowski, Marek Cygan, Kamil Adamczewski, Sebastian Jaszczur

PDF

TL;DR

This paper presents RLRS, a novel method for adjusting learning rates across different parts of Transformer models, significantly speeding up training and reducing computational costs, especially for large-scale models.

Contribution

Introduction of RLRS, a relative learning rate schedule that improves training efficiency and scalability for large Transformer-based models.

Findings

01

Accelerates training by up to 23% in complex models.

02

Hyperparameters tuned on small models transfer well to larger ones.

03

Reduces training time and computational resources significantly.

Abstract

In this work, we introduce a novel approach for optimizing LLM training by adjusting learning rates across weights of different components in Transformer models. Traditional methods often apply a uniform learning rate across all network layers, potentially overlooking the unique dynamics of each part. Remarkably, our introduced relative learning rates, RLRS, method accelerates the training process by up to $23%$ , particularly in complex models such as Mixture of Experts (MoE). Hyperparameters of RLRS can be efficiently tuned on smaller models and then effectively reused on models up to $27 \times$ larger. This simple and effective method results in a substantial reduction in training time and computational resources, offering a practical and scalable solution for optimizing large-scale neural networks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.