Narrowing the Focus: Learned Optimizers for Pretrained Models

Gus Kristiansen; Mark Sandler; Andrey Zhmoginov; Nolan Miller; Anirudh; Goyal; Jihwan Lee; Max Vladymyrov

arXiv:2408.09310·cs.LG·October 8, 2024

Narrowing the Focus: Learned Optimizers for Pretrained Models

Gus Kristiansen, Mark Sandler, Andrey Zhmoginov, Nolan Miller, Anirudh, Goyal, Jihwan Lee, Max Vladymyrov

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a specialized learned optimizer that combines multiple update directions to adapt to specific training environments, significantly improving performance on image classification tasks over traditional and general learned optimizers.

Contribution

It proposes a novel layer-specific linear combination approach for learned optimizers, tailored to specific models and datasets, enhancing effectiveness and robustness.

Findings

01

Outperforms traditional optimizers like Adam on image classification.

02

Shows strong generalization across different datasets and training durations.

03

Demonstrates robustness to model initialization.

Abstract

In modern deep learning, the models are learned by applying gradient updates using an optimizer, which transforms the updates based on various statistics. Optimizers are often hand-designed and tuning their hyperparameters is a big part of the training process. Learned optimizers have shown some initial promise, but are generally unsuccessful as a general optimization mechanism applicable to every problem. In this work we explore a different direction: instead of learning general optimizers, we instead specialize them to a specific training environment. We propose a novel optimizer technique that learns a layer-specific linear combination of update directions provided by a set of base optimizers, effectively adapting its strategy to the specific model and dataset. When evaluated on image classification tasks, this specialized optimizer significantly outperforms both traditional…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

1. The presentation is clear and easy to understand; The experiments are comprehensive and well-explained. 2. The proposed method is novel and useful: it eliminates the need for learning rate schedule tuning. 3. The proposed method demonstrates faster convergence and improved performance.

Weaknesses

1. L3RS is not generalized and needs to be trained independently for each model architecture, which limits its application. From ImageNet <-> PLACES experiments it seems a trained L3RS can generalize to other data distributions, and it would be more beneficial to include more benchmarks to showcase that. 2. Experiments are only conducted on ResNet34, while including evaluation on other architectures and more diverse benchmarks (e.g. [VeLO](https://arxiv.org/abs/2211.09760)) would make the result

Reviewer 02Rating 3Confidence 3

Strengths

1) The paper studies the important challenge of automating hyperparameter tuning for optimizers, which is a critical aspect of training neural networks efficiently. 2) By integrating the update directions from multiple optimizers and learning the optimal combination, the proposed method introduces a promising and potentially more adaptable approach to optimizer design. 3) L3RS shows enhanced performance in the early stages of fine-tuning compared to baseline methods.

Weaknesses

1) My main concern is the applicability of this approach. According to Figure 3, the proposed learned optimizer only outperforms Adam in the early stages with smaller training steps. For larger training steps, Adam achieves comparable performance to the learned optimizer. Given that the learned optimizer requires extra time and memory to learn additional parameters, using it may not be necessary. Furthermore, the total convergence time, including the meta-learning process, may be slower than Ada

Reviewer 03Rating 5Confidence 4

Strengths

The experiments on task distribution are comprehensive. The section on preliminaries is comprehensive. The ablation studies show the effectiveness of each design choice.

Weaknesses

- The abstract is poorly written with wording that replaces necessary definitions. What is meant by “transforms the updates” (missing context), “various statistics” (can be replaced with an actual example) or “hand-designed” (didn’t understand what this could be in relation to)? In general, in terms of writing, the paper seems rushed and incomplete with inconsistencies in writing quality. - L154: There is a mention of "a lot of parameters". A casual writing style isn't appropriate. Either the n

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReservoir Engineering and Simulation Methods

MethodsSparse Evolutionary Training · Balanced Selection · Adam