MoE-Mamba: Efficient Selective State Space Models with Mixture of   Experts

Maciej Pi\'oro; Kamil Ciebiera; Krystian Kr\'ol; Jan Ludziejewski,; Micha{\l} Krutul; Jakub Krajewski; Szymon Antoniak; Piotr Mi{\l}o\'s; Marek; Cygan; Sebastian Jaszczur

arXiv:2401.04081·cs.LG·February 27, 2024·23 cites

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

Maciej Pi\'oro, Kamil Ciebiera, Krystian Kr\'ol, Jan Ludziejewski,, Micha{\l} Krutul, Jakub Krajewski, Szymon Antoniak, Piotr Mi{\l}o\'s, Marek, Cygan, Sebastian Jaszczur

PDF

Open Access 1 Repo

TL;DR

MoE-Mamba combines State Space Models with Mixture of Experts to significantly improve sequential modeling efficiency and performance, achieving comparable results to Mamba with fewer training steps and outperforming Transformer-based models.

Contribution

This paper introduces MoE-Mamba, a novel integration of SSMs and MoE that enhances scaling and efficiency in sequential modeling tasks.

Findings

01

MoE-Mamba outperforms Mamba and Transformer-MoE in benchmarks.

02

Achieves same performance as Mamba in 2.35x fewer training steps.

03

Maintains inference performance gains of Mamba over Transformer.

Abstract

State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable performance. Our model, MoE-Mamba, outperforms both Mamba and baseline Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in $2.35 \times$ fewer training steps while preserving the inference performance gains of Mamba against Transformer.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llm-random/llm-random
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Explainable Artificial Intelligence (XAI)

MethodsLinear Layer · Dropout · Adam · Layer Normalization · Residual Connection · Absolute Position Encodings · Dense Connections · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax