Loading paper
When and Why Grouping Attention Heads Accelerates Muon Optimization | Tomesphere