Mixture of Tokens: Continuous MoE through Cross-Example Aggregation
Szymon Antoniak, Micha{\l} Krutul, Maciej Pi\'oro, Jakub Krajewski,, Jan Ludziejewski, Kamil Ciebiera, Krystian Kr\'ol, Tomasz Odrzyg\'o\'zd\'z,, Marek Cygan, Sebastian Jaszczur

TL;DR
The paper introduces Mixture of Tokens (MoT), a continuous MoE architecture that scales parameters efficiently, is compatible with autoregressive tasks, and matches state-of-the-art performance while being faster to train.
Contribution
MoT is a novel continuous MoE design that assigns token mixtures across examples, enabling scalable, autoregressive-compatible models with competitive performance.
Findings
MoT achieves 3x faster training than dense Transformers.
MoT matches state-of-the-art MoE performance.
A new technique called transition tuning links MoT and traditional MoE.
Abstract
Mixture of Experts (MoE) models based on Transformer architecture are pushing the boundaries of language and vision tasks. The allure of these models lies in their ability to substantially increase the parameter count without a corresponding increase in FLOPs. Most widely adopted MoE models are discontinuous with respect to their parameters - often referred to as sparse. At the same time, existing continuous MoE designs either lag behind their sparse counterparts or are incompatible with autoregressive decoding. Motivated by the observation that the adaptation of fully continuous methods has been an overarching trend in deep learning, we develop Mixture of Tokens (MoT), a simple, continuous architecture that is capable of scaling the number of parameters similarly to sparse MoE models. Unlike conventional methods, MoT assigns mixtures of tokens from different examples to each expert.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Multi-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Absolute Position Encodings · Adam · Label Smoothing · Position-Wise Feed-Forward Layer
