A Self-Attentive Meta-Optimizer with Group-Adaptive Learning Rates and Weight Decay
JiangBo Zhao, ZhaoXin Liu

TL;DR
MetaAdamW is a novel optimizer that uses self-attention to adapt learning rates and weight decay per parameter group, leading to improved training efficiency and performance across diverse tasks.
Contribution
It introduces a self-attentive meta-optimizer with task-specific uncertainty weighting, addressing uniform hyperparameter limitations in adaptive optimizers.
Findings
MetaAdamW outperforms AdamW on five diverse tasks.
It reduces training time by up to 17.11%.
It improves performance metrics by up to 11.08%.
Abstract
Adaptive optimizers like AdamW apply uniform hyperparameters across all parameter groups, ignoring heterogeneous optimization dynamics across layers and modules. We address this limitation by proposing MetaAdamW - a new optimizer that integrates a self-attention mechanism to dynamically modulate per-group learning rates and weight decay. The modulation factors are produced by a lightweight Transformer encoder that operates on statistical features (gradient norms, momentum norms, correlations) extracted from each parameter group. To train the attention module, we introduce a meta-learning objective that combines gradient alignment, loss decrease, and generalization gap. A key novel contribution is the extension of homoscedastic uncertainty weighting (HUW) with task-specific priorities that directly scale the regularization terms - enabling domain knowledge to guide automatic loss…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
