Celo2: Towards Learned Optimization Free Lunch

Abhinav Moudgil; Boris Knyazev; Eugene Belilovsky

arXiv:2602.19142·cs.LG·February 24, 2026

Celo2: Towards Learned Optimization Free Lunch

Abhinav Moudgil, Boris Knyazev, Eugene Belilovsky

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a simple, normalized learned optimizer that can be meta-trained efficiently and generalizes well across large-scale and diverse tasks, making learned optimization more practical.

Contribution

It demonstrates that a normalized optimizer architecture, combined with augmented meta-training, enables effective meta-learning with minimal compute and scales to billion-parameter tasks.

Findings

01

Meta-training requires only 4.5 GPU hours.

02

Scales effectively to GPT-3 XL 1.3B parameters.

03

Performs well on out-of-distribution tasks.

Abstract

Learned optimizers are powerful alternatives to hand-designed update rules like Adam, yet they have seen limited practical adoption since they often fail to meta-generalize beyond their training distribution and incur high meta-training cost. For instance, prior work, VeLO, scaled meta-training to 4,000 TPU months ( $\sim$ 10 $\times$ GPT-3 compute) to meta-train a general-purpose optimizer but it failed to generalize beyond 600M parameters tasks. In this work, we present a surprising finding: by crafting a simple normalized optimizer architecture and augmenting meta-training, it becomes feasible to meta-train a performant general-purpose learned update rule on a tiny fraction of VeLO compute, 4.5 GPU hours to be precise. Our learned update rule scales stably to a billion-scale pretraining task (GPT-3 XL 1.3B) which is six orders of magnitude larger than its meta-training distribution.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The method proposed is simple yet effective, indicating a great potential for generalization and further improvement. The experimental results provide strong support for the claims. The contribution of this paper is significant towards the practical application of LOs.

Weaknesses

I suggest avoiding the term "free lunch" in the paper, as the no-free-lunch theorem indicates that the inductive bias found by Celo2 will likely downgrade performance on some tasks, although these tasks may be unlikely in realistic applications. The paper states, "We would like to emphasize that our primary objective while developing this simple approach for learned optimization was stability." However, no sensitivity analysis is provided. Algorithm 1 is not referred to in the text.

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper's core finding—that a 4.5 GPU-hour "toy" meta-training can produce an LO that scales to 1.3B parameter models—is a "free lunch" that could fundamentally realign research in this field. It moves LOs from a "computationally impossible" (VeLO) to a "highly practical" domain. 2. The paper rightly centers its comparison against VeLO. The results are stark: VeLO, meta-trained with 4000 TPU-months, is unstable and fails on large tasks, while Celo2, trained for 4.5 hours, is stable and sup

Weaknesses

The paper makes extraordinary claims (4.5 GPU-hour meta-training, 6-orders-of-magnitude generalization) with a simple recipe. Such "too good to be true" results demand exceptional evidence. However, the paper is missing an appendix and supplementary material, providing no code, implementation details, or full hyperparameter lists beyond what is in the main text. This makes it impossible to verify the claims or assess reproducibility, which is paramount for such a shocking result.

Reviewer 03Rating 2Confidence 3

Strengths

The proposed Celo2 does seem to simpler, scalable and generalizable based on the experiments. There are a wide variety of experiments to back up the claims, e.g. vision, language, and RL.

Weaknesses

The proposed method does not provide sound theoretical guarantee, making it feel ad-hoc. This raises significant concerns about its practical application. In particularly, I would be very concerning on how would a practitioner use the proposed Celo2 in real-world application. For example, it is not clear how to replace the Adam with small MLP? It is unclear if this technique was found to work only on specific, "cherry-picked" problems and whether the MLP architecture must be significantly re-tun

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications