Mixtral of Experts

Albert Q. Jiang; Alexandre Sablayrolles; Antoine Roux; Arthur Mensch,; Blanche Savary; Chris Bamford; Devendra Singh Chaplot; Diego de las Casas,; Emma Bou Hanna; Florian Bressand; Gianna Lengyel; Guillaume Bour; Guillaume; Lample; L\'elio Renard Lavaud; Lucile Saulnier; Marie-Anne Lachaux; Pierre; Stock; Sandeep Subramanian; Sophia Yang; Szymon Antoniak; Teven Le Scao,; Th\'eophile Gervet; Thibaut Lavril; Thomas Wang; Timoth\'ee Lacroix; William; El Sayed

arXiv:2401.04088·cs.LG·January 9, 2024·120 cites

Mixtral of Experts

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch,, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas,, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume, Lample, L\'elio Renard Lavaud, Lucile Saulnier

PDF

Open Access 5 Repos 10 Models 1 Datasets 2 Videos

TL;DR

Mixtral 8x7B is a sparse mixture of experts language model with 47 billion parameters accessed through dynamic expert selection, outperforming larger models like Llama 2 70B and GPT-3.5 across various benchmarks.

Contribution

Introduces Mixtral 8x7B, a novel sparse mixture of experts model with dynamic expert routing, achieving high performance with fewer active parameters during inference.

Findings

01

Outperforms Llama 2 70B and GPT-3.5 on multiple benchmarks.

02

Vastly outperforms Llama 2 70B in mathematics, code, and multilingual tasks.

03

Fine-tuned version surpasses GPT-3.5 Turbo and Claude-2.1 on human benchmarks.

Abstract

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

toloka/beemo
dataset· 303 dl
303 dl

Videos

Mixtral of Experts (Paper Explained)· youtube

Gemini 1.5 and The Biggest Night in AI· youtube

Taxonomy

TopicsTopic Modeling · Machine Learning and Data Classification · Machine Learning and Algorithms

Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Linear Layer · Attention Dropout · Dropout · Adam · Layer Normalization