DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek-AI: Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai, Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao,, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo, Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang

TL;DR
DeepSeek LLM introduces a scalable open-source language model with a large, expanding dataset, demonstrating superior performance over existing models like LLaMA-2 and GPT-3.5 in code, math, and reasoning tasks.
Contribution
The paper presents DeepSeek LLM, a new open-source model with a long-term development perspective, supported by a massive dataset and improved training techniques.
Findings
DeepSeek LLM 67B outperforms LLaMA-2 70B on multiple benchmarks.
DeepSeek LLM 67B Chat surpasses GPT-3.5 in open-ended evaluations.
A 2 trillion token dataset supports effective training and scaling.
Abstract
The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsMulti-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Linear Layer · Attention Dropout · {Dispute@FaQ-s}How to file a dispute with Expedia? · Residual Connection · Dense Connections · Linear Warmup With Cosine Annealing
