Momentum Stiefel Optimizer, with Applications to Suitably-Orthogonal Attention, and Optimal Transport
Lingkai Kong, Yuqing Wang, Molei Tao

TL;DR
This paper introduces a novel momentum-based optimizer for the Stiefel manifold that preserves orthogonality constraints efficiently, improving performance in applications like orthogonal attention in Vision Transformers and high-dimensional optimal transport.
Contribution
It presents the first optimizer combining continuous and discrete dynamics for manifold optimization, with intrinsic momentum and adaptive learning rates, enhancing practical task performance.
Findings
Orthogonal constraints on attention heads improve Vision Transformer performance.
The optimizer enhances the effectiveness of Projection Robust Wasserstein Distance.
The method maintains manifold structure with low computational cost.
Abstract
The problem of optimization on Stiefel manifold, i.e., minimizing functions of (not necessarily square) matrices that satisfy orthogonality constraints, has been extensively studied. Yet, a new approach is proposed based on, for the first time, an interplay between thoughtfully designed continuous and discrete dynamics. It leads to a gradient-based optimizer with intrinsically added momentum. This method exactly preserves the manifold structure but does not require additional operation to keep momentum in the changing (co)tangent space, and thus has low computational cost and pleasant accuracy. Its generalization to adaptive learning rates is also demonstrated. Notable performances are observed in practical tasks. For instance, we found that placing orthogonal constraints on attention heads of trained-from-scratch Vision Transformer [Dosovitskiy et al. 2022] could markedly improve its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Medical Image Segmentation Techniques · Brain Tumor Detection and Classification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Absolute Position Encodings · Dropout · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Residual Connection
