Decoupled Relative Learning Rate Schedules
Jan Ludziejewski, Jan Ma{\l}a\'snicki, Maciej Pi\'oro, Micha{\l} Krutul, Kamil Ciebiera, Maciej Stefaniak, Jakub Krajewski, Piotr Sankowski, Marek Cygan, Kamil Adamczewski, Sebastian Jaszczur

TL;DR
This paper presents RLRS, a novel method for adjusting learning rates across different parts of Transformer models, significantly speeding up training and reducing computational costs, especially for large-scale models.
Contribution
Introduction of RLRS, a relative learning rate schedule that improves training efficiency and scalability for large Transformer-based models.
Findings
Accelerates training by up to 23% in complex models.
Hyperparameters tuned on small models transfer well to larger ones.
Reduces training time and computational resources significantly.
Abstract
In this work, we introduce a novel approach for optimizing LLM training by adjusting learning rates across weights of different components in Transformer models. Traditional methods often apply a uniform learning rate across all network layers, potentially overlooking the unique dynamics of each part. Remarkably, our introduced relative learning rates, RLRS, method accelerates the training process by up to , particularly in complex models such as Mixture of Experts (MoE). Hyperparameters of RLRS can be efficiently tuned on smaller models and then effectively reused on models up to larger. This simple and effective method results in a substantial reduction in training time and computational resources, offering a practical and scalable solution for optimizing large-scale neural networks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
