An Isometric Stochastic Optimizer
Jacob Jackson

TL;DR
This paper introduces Iso, a novel optimizer inspired by Adam's success, which maintains update norm invariance under linear transformations, leading to improved training speed for small Transformers.
Contribution
The paper proposes Iso, an isometric optimizer that ensures update norm invariance, and IsoAdam, a variant enabling hyperparameter transfer from Adam, demonstrating practical speed improvements.
Findings
IsoAdam outperforms Adam in small Transformer training.
Iso maintains update invariance under linear transformations.
Hyperparameters can be effectively transferred from Adam to IsoAdam.
Abstract
The Adam optimizer is the standard choice in deep learning applications. I propose a simple explanation of Adam's success: it makes each parameter's step size independent of the norms of the other parameters. Based on this principle I derive Iso, a new optimizer which makes the norm of a parameter's update invariant to the application of any linear transformation to its inputs and outputs. I develop a variant of Iso called IsoAdam that allows optimal hyperparameters to be transferred from Adam, and demonstrate that IsoAdam obtains a speedup over Adam when training a small Transformer.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Image and Signal Denoising Methods · Model Reduction and Neural Networks
MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Layer Normalization · Absolute Position Encodings · Linear Layer · Softmax · Dense Connections · Dropout · Residual Connection
