SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization
Kwangryeol Park, Seulki Lee

TL;DR
SMMF introduces a memory-efficient optimization method that significantly reduces memory usage of adaptive optimizers like Adam, while maintaining comparable performance across CNN and Transformer models.
Contribution
It presents a novel square-matricization technique for flexible, low-rank momentum tensor factorization applicable to various deep learning architectures.
Findings
Reduces memory usage by up to 96% compared to existing optimizers.
Achieves similar model performance on CNN and Transformer tasks.
Provides theoretical convergence guarantees similar to AdamNC.
Abstract
We propose SMMF (Square-Matricized Momentum Factorization), a memory-efficient optimizer that reduces the memory requirement of the widely used adaptive learning rate optimizers, such as Adam, by up to 96%. SMMF enables flexible and efficient factorization of an arbitrary rank (shape) of the first and second momentum tensors during optimization, based on the proposed square-matricization and one-time single matrix factorization. From this, it becomes effectively applicable to any rank (shape) of momentum tensors, i.e., bias, matrix, and any rank-d tensors, prevalent in various deep model architectures, such as CNNs (high rank) and Transformers (low rank), in contrast to existing memory-efficient optimizers that applies only to a particular (rank-2) momentum tensor, e.g., linear layers. We conduct a regret bound analysis of SMMF, which shows that it converges similarly to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Data Compression Techniques
MethodsAttention Is All You Need · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing · Layer Normalization
