Mixtral of Experts
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch,, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas,, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume, Lample, L\'elio Renard Lavaud, Lucile Saulnier

TL;DR
Mixtral 8x7B is a sparse mixture of experts language model with 47 billion parameters accessed through dynamic expert selection, outperforming larger models like Llama 2 70B and GPT-3.5 across various benchmarks.
Contribution
Introduces Mixtral 8x7B, a novel sparse mixture of experts model with dynamic expert routing, achieving high performance with fewer active parameters during inference.
Findings
Outperforms Llama 2 70B and GPT-3.5 on multiple benchmarks.
Vastly outperforms Llama 2 70B in mathematics, code, and multilingual tasks.
Fine-tuned version surpasses GPT-3.5 Turbo and Claude-2.1 on human benchmarks.
Abstract
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗jingyaogong/minimind-3model· 81 dl· ♡ 181 dl♡ 1
- 🤗HIT-SCIR/Chinese-Mixtral-8x7B-adaptermodel· ♡ 1♡ 1
- 🤗HIT-SCIR/Chinese-Mixtral-8x7Bmodel· 8.2k dl· ♡ 458.2k dl♡ 45
- 🤗tenyx/TenyxChat-8x7B-v1model· 704 dl· ♡ 12704 dl♡ 12
- 🤗LoneStriker/Chinese-Mixtral-8x7B-2.4bpw-h6-exl2model
- 🤗LoneStriker/Chinese-Mixtral-8x7B-3.0bpw-h6-exl2model· 5 dl5 dl
- 🤗LoneStriker/Chinese-Mixtral-8x7B-3.5bpw-h6-exl2model· 1 dl1 dl
- 🤗LoneStriker/Chinese-Mixtral-8x7B-3.75bpw-h6-exl2model· 1 dl1 dl
- 🤗LoneStriker/Chinese-Mixtral-8x7B-5.0bpw-h6-exl2model
- 🤗LoneStriker/Chinese-Mixtral-8x7B-6.0bpw-h6-exl2model· 3 dl3 dl
Videos
Mixtral of Experts (Paper Explained)· youtube
Gemini 1.5 and The Biggest Night in AI· youtube
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Machine Learning and Algorithms
Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Linear Layer · Attention Dropout · Dropout · Adam · Layer Normalization
