Mixture of Raytraced Experts
Andrea Perin, Giacomo Lagomarsini, Claudio Gallicchio, Giuseppe Nuti

TL;DR
This paper introduces a dynamic Mixture of Raytraced Experts architecture that adaptively selects expert sequences, improving accuracy and training efficiency without load-balancing, and enabling more flexible and expressive models.
Contribution
It presents a novel MoE architecture that dynamically sequences experts, reducing training epochs and increasing model flexibility without load-balancing mechanisms.
Findings
Training epochs reduced by 10-40%
Achieved comparable or higher accuracy
Enables variable-width and depth computation graphs
Abstract
We introduce a Mixture of Raytraced Experts, a stacked Mixture of Experts (MoE) architecture which can dynamically select sequences of experts, producing computational graphs of variable width and depth. Existing MoE architectures generally require a fixed amount of computation for a given sample. Our approach, in contrast, yields predictions with increasing accuracy as the computation cycles through the experts' sequence. We train our model by iteratively sampling from a set of candidate experts, unfolding the sequence akin to how Recurrent Neural Networks are trained. Our method does not require load-balancing mechanisms, and preliminary experiments show a reduction in training epochs of 10\% to 40\% with a comparable/higher accuracy. These results point to new research directions in the field of MoEs, allowing the design of potentially faster and more expressive models. The code is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConferences and Exhibitions Management
