DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu,, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang,, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao,, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang

TL;DR
DeepSeek-V2 is an advanced Mixture-of-Experts language model that offers high performance with economical training, efficient inference, and innovative architectures, supporting very long contexts and achieving top-tier results among open-source models.
Contribution
The paper introduces DeepSeek-V2, featuring novel architectures like MLA and DeepSeekMoE, enabling strong, cost-effective training and efficient inference for large-scale language modeling.
Findings
Achieves superior performance compared to previous models.
Reduces training costs by 42.5%.
Increases generation throughput by 5.76 times.
Abstract
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗deepseek-ai/DeepSeek-V2.5model· 6.1k dl· ♡ 7336.1k dl♡ 733
- 🤗jingyaogong/minimind-3model· 81 dl· ♡ 181 dl♡ 1
- 🤗deepseek-ai/DeepSeek-V2model· 20k dl· ♡ 33320k dl♡ 333
- 🤗deepseek-ai/DeepSeek-V2-Chatmodel· 12k dl· ♡ 46112k dl♡ 461
- 🤗deepseek-ai/DeepSeek-V2-Litemodel· 247k dl· ♡ 168247k dl♡ 168
- 🤗deepseek-ai/DeepSeek-V2-Lite-Chatmodel· 639k dl· ♡ 135639k dl♡ 135
- 🤗Malrio/MyDeepSeekmodel· 2 dl2 dl
- 🤗ZZichen/DeepSeek-V2-Litemodel· 24 dl· ♡ 124 dl♡ 1
- 🤗ZZichen/DeepSeek-V2-Lite-Chatmodel· 8 dl8 dl
- 🤗TechxGenus/DeepSeek-V2-Lite-Chat-AWQmodel· 126 dl· ♡ 2126 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExpert finding and Q&A systems · Topic Modeling · Speech and dialogue systems
