Mnemosyne: Learning to Train Transformers with Transformers

Deepali Jain; Krzysztof Marcin Choromanski; Avinava Dubey; Sumeet; Singh; Vikas Sindhwani; Tingnan Zhang; Jie Tan

arXiv:2302.01128·cs.LG·June 21, 2023

Mnemosyne: Learning to Train Transformers with Transformers

Deepali Jain, Krzysztof Marcin Choromanski, Avinava Dubey, Sumeet, Singh, Vikas Sindhwani, Tingnan Zhang, Jie Tan

PDF

Open Access 1 Video

TL;DR

Mnemosyne introduces a novel learnable optimizer based on spatio-temporal low-rank implicit attention Transformers that can train various neural networks, including Transformers, without task-specific tuning, achieving state-of-the-art results efficiently.

Contribution

It presents Mnemosyne, a new learnable optimizer leveraging implicit attention Transformers, capable of training diverse models without hyper-parameter tuning and with scalable space complexity.

Findings

01

Outperforms popular LSTM optimizers.

02

Successfully trains Transformers with minimal resources.

03

Matches state-of-the-art hand-designed optimizers.

Abstract

In this work, we propose a new class of learnable optimizers, called \textit{Mnemosyne}. It is based on the novel spatio-temporal low-rank implicit attention Transformers that can learn to train entire neural network architectures, including other Transformers, without any task-specific optimizer tuning. We show that Mnemosyne: (a) outperforms popular LSTM optimizers (also with new feature engineering to mitigate catastrophic forgetting of LSTMs), (b) can successfully train Transformers while using simple meta-training strategies that require minimal computational resources, (c) matches accuracy-wise SOTA hand-designed optimizers with carefully tuned hyper-parameters (often producing top performing models). Furthermore, Mnemosyne provides space complexity comparable to that of its hand-designed first-order counterparts, which allows it to scale to training larger sets of parameters. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Mnemosyne: Learning to Train Transformers with Transformers· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Sigmoid Activation · Adam · Multi-Head Attention · Weight Decay · Residual Connection · Dense Connections · Dropout