GQA-{\mu}P: The maximal parameterization update for grouped query attention
Kyle R. Chickering, Huijuan Wang, Mengxi Wu, Alexander Moreno, Muhao Chen, Xuezhe Ma, Daria Soboleva, Joel Hestness, Zhengzhong Liu, Eric Xing

TL;DR
This paper extends the maximal parameterization ({}P) to grouped query attention (GQA) by deriving new scalings, enabling effective transfer of hyperparameters across model architectures.
Contribution
It introduces a novel derivation of {}P scalings for GQA, addressing challenges in transferability and spectral norm conditions.
Findings
Successful transfer of learning rate hyperparameters across GQA models.
Effective transfer over weight decay hyperparameters.
First derivation of {}P scalings for grouped-query attention.
Abstract
Hyperparameter transfer across model architectures dramatically reduces the amount of compute necessary for tuning large language models (LLMs). The maximal update parameterization ({\mu}P) ensures transfer through principled mathematical analysis but can be challenging to derive for new model architectures. Building on the spectral feature-learning view of Yang et al. (2023a), we make two advances. First, we promote spectral norm conditions on the weights from a heuristic to the definition of feature learning, and as a consequence arrive at the Complete-P depth and weight-decay scalings without recourse to lazy-learning. Second, we consider a modified spectral norm that preserves the valid scaling law of network weights when weight matrices are not full rank. This enables (to our knowledge, the first) derivation of {\mu}P scalings for grouped-query attention (GQA). We demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
