Exact gradient updates in time independent of output size for the spherical loss family
Pascal Vincent, Alexandre de Br\'ebisson, Xavier Bouthillier

TL;DR
This paper introduces an efficient algorithm for training neural networks with high-dimensional sparse targets, reducing the computational complexity from linear in output size to quadratic in hidden layer size for certain loss functions.
Contribution
The authors develop a novel method that computes exact gradients and loss in constant time relative to output size for spherical loss functions, bypassing the need to compute large output vectors.
Findings
Achieves up to 250x speedup in training time for large output vocabularies.
Provides exact gradient computation without approximation methods.
Applicable to loss functions like squared error and spherical softmax.
Abstract
An important class of problems involves training deep neural networks with sparse prediction targets of very high dimension D. These occur naturally in e.g. neural language models or the learning of word-embeddings, often posed as predicting the probability of next words among a vocabulary of size D (e.g. 200,000). Computing the equally large, but typically non-sparse D-dimensional output vector from a last hidden layer of reasonable dimension d (e.g. 500) incurs a prohibitive O(Dd) computational cost for each example, as does updating the output weight matrix and computing the gradient needed for backpropagation to previous layers. While efficient handling of large sparse network inputs is trivial, the case of large sparse targets is not, and has thus so far been sidestepped with approximate alternatives such as hierarchical softmax or sampling-based approximations during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Stochastic Gradient Optimization Techniques
MethodsSoftmax
