Spectra: Rethinking Optimizers for LLMs Under Spectral Anisotropy
Zhendong Huang, Hengjie Cao, Fang Dong, Ruijun Huang, Mengyi Chen, Yifeng Yang, Xin Zhang, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Robert P. Dick, Yuan Cheng, Fan Yang, Tun Lu, Li Shang

TL;DR
This paper introduces Spectra, a spectral-aware optimizer designed to address anisotropic gradient signals in LLM training, leading to faster convergence, reduced memory usage, and improved accuracy.
Contribution
Spectra is a novel optimizer that suppresses dominant spectral directions in LLM training without amplifying noise, improving efficiency and performance.
Findings
Spectra reaches target loss 30% faster than AdamW.
Reduces optimizer state memory by 49.25%.
Achieves 1.62% higher downstream accuracy.
Abstract
Gradient signals in LLM training are highly anisotropic: recurrent linguistic structure concentrates energy into a small set of dominant spectral directions, while context specific information resides in a long tail. We show that this spike tail separation persists throughout training, with the spike occupying only about 1.5% of directions yet dominating optimizer statistics. This dominance suppresses tail learning by contracting tail updates through second moment normalization and tightening the globally stable learning rate bound. Motivated by this analysis, we propose Spectra, a spike aware optimizer that suppresses the dominant low rank spike subspace without amplifying the noise sensitive spectral tail. Spectra tracks the spike subspace via cached, warm started power iteration and applies low rank spectral shaping with negligible overhead and substantially reduced optimizer state…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMuon and positron interactions and applications · Parallel Computing and Optimization Techniques · Particle Detector Development and Performance
