A Self-Attentive Meta-Optimizer with Group-Adaptive Learning Rates and Weight Decay

JiangBo Zhao; ZhaoXin Liu

arXiv:2605.04055·cs.LG·May 7, 2026

A Self-Attentive Meta-Optimizer with Group-Adaptive Learning Rates and Weight Decay

JiangBo Zhao, ZhaoXin Liu

PDF

TL;DR

MetaAdamW is a novel optimizer that uses self-attention to adapt learning rates and weight decay per parameter group, leading to improved training efficiency and performance across diverse tasks.

Contribution

It introduces a self-attentive meta-optimizer with task-specific uncertainty weighting, addressing uniform hyperparameter limitations in adaptive optimizers.

Findings

01

MetaAdamW outperforms AdamW on five diverse tasks.

02

It reduces training time by up to 17.11%.

03

It improves performance metrics by up to 11.08%.

Abstract

Adaptive optimizers like AdamW apply uniform hyperparameters across all parameter groups, ignoring heterogeneous optimization dynamics across layers and modules. We address this limitation by proposing MetaAdamW - a new optimizer that integrates a self-attention mechanism to dynamically modulate per-group learning rates and weight decay. The modulation factors are produced by a lightweight Transformer encoder that operates on statistical features (gradient norms, momentum norms, correlations) extracted from each parameter group. To train the attention module, we introduce a meta-learning objective that combines gradient alignment, loss decrease, and generalization gap. A key novel contribution is the extension of homoscedastic uncertainty weighting (HUW) with task-specific priorities that directly scale the regularization terms - enabling domain knowledge to guide automatic loss…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.