Momentum Stiefel Optimizer, with Applications to Suitably-Orthogonal   Attention, and Optimal Transport

Lingkai Kong; Yuqing Wang; Molei Tao

arXiv:2205.14173·cs.LG·March 6, 2023

Momentum Stiefel Optimizer, with Applications to Suitably-Orthogonal Attention, and Optimal Transport

Lingkai Kong, Yuqing Wang, Molei Tao

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel momentum-based optimizer for the Stiefel manifold that preserves orthogonality constraints efficiently, improving performance in applications like orthogonal attention in Vision Transformers and high-dimensional optimal transport.

Contribution

It presents the first optimizer combining continuous and discrete dynamics for manifold optimization, with intrinsic momentum and adaptive learning rates, enhancing practical task performance.

Findings

01

Orthogonal constraints on attention heads improve Vision Transformer performance.

02

The optimizer enhances the effectiveness of Projection Robust Wasserstein Distance.

03

The method maintains manifold structure with low computational cost.

Abstract

The problem of optimization on Stiefel manifold, i.e., minimizing functions of (not necessarily square) matrices that satisfy orthogonality constraints, has been extensively studied. Yet, a new approach is proposed based on, for the first time, an interplay between thoughtfully designed continuous and discrete dynamics. It leads to a gradient-based optimizer with intrinsically added momentum. This method exactly preserves the manifold structure but does not require additional operation to keep momentum in the changing (co)tangent space, and thus has low computational cost and pleasant accuracy. Its generalization to adaptive learning rates is also demonstrated. Notable performances are observed in practical tasks. For instance, we found that placing orthogonal constraints on attention heads of trained-from-scratch Vision Transformer [Dosovitskiy et al. 2022] could markedly improve its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

konglk1203/variationalstiefeloptimizer
pytorchOfficial

Videos

Momentum Stiefel Optimizer, with Applications to Suitably-Orthogonal Attention, and Optimal Transport· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Medical Image Segmentation Techniques · Brain Tumor Detection and Classification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Absolute Position Encodings · Dropout · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Residual Connection