$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

Benjamin Th\'erien; Charles-\'Etienne Joseph; Boris Knyazev; Edouard Oyallon; Irina Rish; Eugene Belilovsky

arXiv:2406.00153·cs.LG·March 20, 2026

$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

Benjamin Th\'erien, Charles-\'Etienne Joseph, Boris Knyazev, Edouard Oyallon, Irina Rish, Eugene Belilovsky

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

This paper introduces $$LO, a meta-training approach for learned optimizers that enhances their ability to generalize to wider, deeper, and longer training tasks, significantly improving efficiency.

Contribution

The paper derives the Maximal Update Parametrization for learned optimizers and proposes a simple meta-training recipe that improves their meta-generalization capabilities.

Findings

01

Meta-trained $$LOs outperform standard parametrization LOs on unseen wider tasks.

02

$$LOs show improved generalization to deeper networks.

03

Enhanced generalization to longer training horizons.

Abstract

Learned optimizers (LOs) have the potential to significantly reduce the wall-clock training time of neural networks. However, they can struggle to optimize unseen tasks (meta-generalize), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization ( $μ$ P) for two state-of-the-art learned optimizer architectures and propose a simple meta-training recipe for $μ$ -parameterized LOs ( $μ$ LOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (SP) using the same compute budget. We also empirically observe that $μ$ LOs exhibit unexpectedly improved meta-generalization to deeper networks ( $5 \times$ meta-training) and surprising generalization to much longer…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 3

Strengths

1. The motivation is interesting and well-grounded. The idea of viewing optimizers as hyperparameters that can generalize across $\mu$P-initialized networks of varying widths is both novel and conceptually appealing. 2. The learned optimizers show encouraging results in zero-shot-like transfer scenarios.

Weaknesses

1. Figure 5 indicates the meta overfitting with the trained optimiser performing well on the meta-train tasks while failing to generalise well on the unseen tasks. 2. From all the experiments, only the learning curve of training stages is illustrated, with the question of whether the learned optimizer leads to advanced generalization ability not answered. 3. Limited novelty: the Mup parameterization proposed in this submission is very close to a direct application of the original $\mu$P pape

Reviewer 02Rating 6Confidence 3

Strengths

While I think applying muP to this domain was inevitable, the authors make a good case that the failure to meta-generalize is the major blocker for LOs, and that they make substantial improvements there. I thought the experiments were reasonable, namely the baselines, and the paper was clear throughout, including the maths (though I haven't gone line by line in the proofs).

Weaknesses

I think the findings of ["Scaling Exponents Across Parameterizations and Optimizers"](https://arxiv.org/abs/2407.05872) should've made an appearance somewhere, since they indicate that standard parametrization can also achieve hyperparameter transfer. They also show that (in larger problem instances than here) that epsilon should be tuned in Adam. Also, since depth scaling is mentioned (albeit as a bonus), I think the related works and perhaps some experiments would ideally address more heurist

Reviewer 03Rating 6Confidence 2

Strengths

1. This paper introduces the Maximal Update Parametrization to address the meta-generalization problem in learned optimizers. The idea is novel and interesting. 2. The experimental results are thorough and effectively demonstrate the method’s validity.

Weaknesses

1. I recommend adding experiments with convolutional neural networks (CNNs). Although this limitation is mentioned, I believe CNNs are currently mainstream in neural network research, and testing the method on them would further strengthen the validity of the findings. I am not very familiar with this field, so I will rely on the feedback from other reviewers for the final score.

Code & Models

Repositories

bentherien/mu_learned_optimization
jax

Models

🤗
btherien/mulo
model· 9 dl
9 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Control Systems Optimization · Fuzzy Logic and Control Systems