Mnemosyne: Learning to Train Transformers with Transformers
Deepali Jain, Krzysztof Marcin Choromanski, Avinava Dubey, Sumeet, Singh, Vikas Sindhwani, Tingnan Zhang, Jie Tan

TL;DR
Mnemosyne introduces a novel learnable optimizer based on spatio-temporal low-rank implicit attention Transformers that can train various neural networks, including Transformers, without task-specific tuning, achieving state-of-the-art results efficiently.
Contribution
It presents Mnemosyne, a new learnable optimizer leveraging implicit attention Transformers, capable of training diverse models without hyper-parameter tuning and with scalable space complexity.
Findings
Outperforms popular LSTM optimizers.
Successfully trains Transformers with minimal resources.
Matches state-of-the-art hand-designed optimizers.
Abstract
In this work, we propose a new class of learnable optimizers, called \textit{Mnemosyne}. It is based on the novel spatio-temporal low-rank implicit attention Transformers that can learn to train entire neural network architectures, including other Transformers, without any task-specific optimizer tuning. We show that Mnemosyne: (a) outperforms popular LSTM optimizers (also with new feature engineering to mitigate catastrophic forgetting of LSTMs), (b) can successfully train Transformers while using simple meta-training strategies that require minimal computational resources, (c) matches accuracy-wise SOTA hand-designed optimizers with carefully tuned hyper-parameters (often producing top performing models). Furthermore, Mnemosyne provides space complexity comparable to that of its hand-designed first-order counterparts, which allows it to scale to training larger sets of parameters. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Sigmoid Activation · Adam · Multi-Head Attention · Weight Decay · Residual Connection · Dense Connections · Dropout
