Celo2: Towards Learned Optimization Free Lunch
Abhinav Moudgil, Boris Knyazev, Eugene Belilovsky

TL;DR
This paper introduces a simple, normalized learned optimizer that can be meta-trained efficiently and generalizes well across large-scale and diverse tasks, making learned optimization more practical.
Contribution
It demonstrates that a normalized optimizer architecture, combined with augmented meta-training, enables effective meta-learning with minimal compute and scales to billion-parameter tasks.
Findings
Meta-training requires only 4.5 GPU hours.
Scales effectively to GPT-3 XL 1.3B parameters.
Performs well on out-of-distribution tasks.
Abstract
Learned optimizers are powerful alternatives to hand-designed update rules like Adam, yet they have seen limited practical adoption since they often fail to meta-generalize beyond their training distribution and incur high meta-training cost. For instance, prior work, VeLO, scaled meta-training to 4,000 TPU months (10 GPT-3 compute) to meta-train a general-purpose optimizer but it failed to generalize beyond 600M parameters tasks. In this work, we present a surprising finding: by crafting a simple normalized optimizer architecture and augmenting meta-training, it becomes feasible to meta-train a performant general-purpose learned update rule on a tiny fraction of VeLO compute, 4.5 GPU hours to be precise. Our learned update rule scales stably to a billion-scale pretraining task (GPT-3 XL 1.3B) which is six orders of magnitude larger than its meta-training distribution.…
Peer Reviews
Decision·ICLR 2026 Poster
The method proposed is simple yet effective, indicating a great potential for generalization and further improvement. The experimental results provide strong support for the claims. The contribution of this paper is significant towards the practical application of LOs.
I suggest avoiding the term "free lunch" in the paper, as the no-free-lunch theorem indicates that the inductive bias found by Celo2 will likely downgrade performance on some tasks, although these tasks may be unlikely in realistic applications. The paper states, "We would like to emphasize that our primary objective while developing this simple approach for learned optimization was stability." However, no sensitivity analysis is provided. Algorithm 1 is not referred to in the text.
1. The paper's core finding—that a 4.5 GPU-hour "toy" meta-training can produce an LO that scales to 1.3B parameter models—is a "free lunch" that could fundamentally realign research in this field. It moves LOs from a "computationally impossible" (VeLO) to a "highly practical" domain. 2. The paper rightly centers its comparison against VeLO. The results are stark: VeLO, meta-trained with 4000 TPU-months, is unstable and fails on large tasks, while Celo2, trained for 4.5 hours, is stable and sup
The paper makes extraordinary claims (4.5 GPU-hour meta-training, 6-orders-of-magnitude generalization) with a simple recipe. Such "too good to be true" results demand exceptional evidence. However, the paper is missing an appendix and supplementary material, providing no code, implementation details, or full hyperparameter lists beyond what is in the main text. This makes it impossible to verify the claims or assess reproducibility, which is paramount for such a shocking result.
The proposed Celo2 does seem to simpler, scalable and generalizable based on the experiments. There are a wide variety of experiments to back up the claims, e.g. vision, language, and RL.
The proposed method does not provide sound theoretical guarantee, making it feel ad-hoc. This raises significant concerns about its practical application. In particularly, I would be very concerning on how would a practitioner use the proposed Celo2 in real-world application. For example, it is not clear how to replace the Adam with small MLP? It is unclear if this technique was found to work only on specific, "cherry-picked" problems and whether the MLP architecture must be significantly re-tun
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications
