Learning Factored Representations in a Deep Mixture of Experts
David Eigen, Marc'Aurelio Ranzato, Ilya Sutskever

TL;DR
This paper introduces a deep, stacked mixture of experts model that increases the effective number of experts exponentially while maintaining a modest size, demonstrating specialized experts for location and class in image and speech tasks.
Contribution
The work extends Mixture of Experts to a deep, multi-layer model, enabling exponential growth in expert combinations with efficient computation and training.
Findings
Learned location-dependent experts for images
Developed class-specific experts at deeper layers
Effectively used all expert combinations in speech data
Abstract
Mixtures of Experts combine the outputs of several "expert" networks, each of which specializes in a different part of the input space. This is achieved by training a "gating" network that maps each input to a distribution over the experts. Such models show promise for building larger networks that are still cheap to compute at test time, and more parallelizable at training time. In this this work, we extend the Mixture of Experts to a stacked model, the Deep Mixture of Experts, with multiple sets of gating and experts. This exponentially increases the number of effective experts by associating each input with a combination of experts at each layer, yet maintains a modest model size. On a randomly translated version of the MNIST dataset, we find that the Deep Mixture of Experts automatically learns to develop location-dependent ("where") experts at the first layer, and class-specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
