MultiMax: Sparse and Multi-Modal Attention Learning
Yuxuan Zhou, Mario Fritz, Margret Keuper

TL;DR
MultiMax introduces a novel piece-wise differentiable function that enhances sparsity and multi-modality in attention mechanisms, improving interpretability and performance across various machine learning tasks.
Contribution
It proposes MultiMax, a new function that balances sparsity and multi-modality, overcoming limitations of SoftMax variants in neural attention models.
Findings
MultiMax effectively suppresses irrelevant entries in distributions.
It preserves multi-modality better than SoftMax variants.
Demonstrated improvements in image classification, language modeling, and machine translation.
Abstract
SoftMax is a ubiquitous ingredient of modern machine learning algorithms. It maps an input vector onto a probability simplex and reweights the input by concentrating the probability mass at large entries. Yet, as a smooth approximation to the Argmax function, a significant amount of probability mass is distributed to other, residual entries, leading to poor interpretability and noise. Although sparsity can be achieved by a family of SoftMax variants, they often require an alternative loss function and do not preserve multi-modality. We show that this trade-off between multi-modality and sparsity limits the expressivity of SoftMax as well as its variants. We provide a solution to this tension between objectives by proposing a piece-wise differentiable function, termed MultiMax, which adaptively modulates the output distribution according to input entry range. Through comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsSoftmax
