When and Why Grouping Attention Heads Accelerates Muon Optimization
Hongtao Zhang,Wenjie Zhou,Wei Chen,Xueqi Cheng

TL;DR
This paper investigates how grouping attention heads affects the efficiency of Muon optimization, proposing Group Muon to optimize head group size and grouping rules for better performance.
Contribution
It introduces Group Muon, a method that treats head group size and grouping as hyperparameters, improving optimization in transformer models.
Findings
Group-wise updates offer a whitening gain but incur a norm cost.
Appropriate grouping improves validation loss on GPT-2 Small.
Group Muon outperforms full-QKV Muon and head-wise MuonSplit.
Abstract
Muon orthogonalizes matrix updates, but multi-head attention naturally operates at the level of heads. This granularity mismatch raises the question of whether Muon should be applied to the full attention projection, to individual heads, or to intermediate head groups. We study this question through a one-step descent comparison between full-matrix Muon and group-wise Muon. Our analysis reveals a trade-off between the \textbf{group-wise whitening gain} from group-wise updates and the \textbf{grouping-induced norm cost}, an additional update-norm cost caused by replacing full-matrix whitening with group-wise whitening. Motivated by this trade-off, we propose \textbf{Group Muon}, which treats head group size and grouping rule as optimizer hyperparameters. On GPT-2 Small trained on FineWeb, appropriate grouping improves validation loss over both full-QKV Muon and fully head-wise MuonSplit.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
