From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification
Andr\'e F. T. Martins, Ram\'on Fernandez Astudillo

TL;DR
This paper introduces sparsemax, a novel activation function that produces sparse probability distributions, along with a new loss function, demonstrating promising results in multi-label classification and attention mechanisms with more selective focus.
Contribution
The paper presents sparsemax, a new activation function that yields sparse outputs, and a corresponding loss function, enhancing attention models and multi-label classification.
Findings
Sparsemax achieves similar accuracy to softmax with sparser outputs.
The new loss function relates to the Huber loss, offering robustness.
Empirical results show improved attention focus and classification performance.
Abstract
We propose sparsemax, a new activation function similar to the traditional softmax, but able to output sparse probabilities. After deriving its properties, we show how its Jacobian can be efficiently computed, enabling its use in a network trained with backpropagation. Then, we propose a new smooth and convex loss function which is the sparsemax analogue of the logistic loss. We reveal an unexpected connection between this new loss and the Huber classification loss. We obtain promising empirical results in multi-label classification problems and in attention-based neural networks for natural language inference. For the latter, we achieve a similar performance as the traditional softmax, but with a selective, more compact, attention focus.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Machine Learning and Algorithms
MethodsSparsemax
