When and Why Grouping Attention Heads Accelerates Muon Optimization

Hongtao Zhang,Wenjie Zhou,Wei Chen,Xueqi Cheng

arXiv:2605.08933·cs.LG·May 12, 2026

When and Why Grouping Attention Heads Accelerates Muon Optimization

Hongtao Zhang,Wenjie Zhou,Wei Chen,Xueqi Cheng

PDF

TL;DR

This paper investigates how grouping attention heads affects the efficiency of Muon optimization, proposing Group Muon to optimize head group size and grouping rules for better performance.

Contribution

It introduces Group Muon, a method that treats head group size and grouping as hyperparameters, improving optimization in transformer models.

Findings

01

Group-wise updates offer a whitening gain but incur a norm cost.

02

Appropriate grouping improves validation loss on GPT-2 Small.

03

Group Muon outperforms full-QKV Muon and head-wise MuonSplit.

Abstract

Muon orthogonalizes matrix updates, but multi-head attention naturally operates at the level of heads. This granularity mismatch raises the question of whether Muon should be applied to the full attention projection, to individual heads, or to intermediate head groups. We study this question through a one-step descent comparison between full-matrix Muon and group-wise Muon. Our analysis reveals a trade-off between the \textbf{group-wise whitening gain} from group-wise updates and the \textbf{grouping-induced norm cost}, an additional update-norm cost caused by replacing full-matrix whitening with group-wise whitening. Motivated by this trade-off, we propose \textbf{Group Muon}, which treats head group size and grouping rule as optimizer hyperparameters. On GPT-2 Small trained on FineWeb, appropriate grouping improves validation loss over both full-QKV Muon and fully head-wise MuonSplit.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.