DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI: Xiao Bi; Deli Chen; Guanting Chen; Shanhuang Chen; Damai; Dai; Chengqi Deng; Honghui Ding; Kai Dong; Qiushi Du; Zhe Fu; Huazuo Gao,; Kaige Gao; Wenjun Gao; Ruiqi Ge; Kang Guan; Daya Guo; Jianzhong Guo; Guangbo; Hao; Zhewen Hao; Ying He; Wenjie Hu; Panpan Huang; Erhang Li; Guowei Li,; Jiashi Li; Yao Li; Y.K. Li; Wenfeng Liang; Fangyun Lin; A.X. Liu; Bo Liu; Wen; Liu; Xiaodong Liu; Xin Liu; Yiyuan Liu; Haoyu Lu; Shanghao Lu; Fuli Luo,; Shirong Ma; Xiaotao Nie; Tian Pei; Yishi Piao; Junjie Qiu; Hui Qu; Tongzheng; Ren; Zehui Ren; Chong Ruan; Zhangli Sha; Zhihong Shao; Junxiao Song; Xuecheng; Su; Jingxiang Sun; Yaofeng Sun; Minghui Tang; Bingxuan Wang; Peiyi Wang,; Shiyu Wang; Yaohui Wang; Yongji Wang; Tong Wu; Y. Wu; Xin Xie; Zhenda Xie,; Ziwei Xie; Yiliang Xiong; Hanwei Xu; R.X. Xu; Yanhong Xu; Dejian Yang,; Yuxiang You; Shuiping Yu; Xingkai Yu; B. Zhang; Haowei Zhang; Lecong Zhang,; Liyue Zhang; Mingchuan Zhang; Minghua Zhang; Wentao Zhang; Yichao Zhang,; Chenggang Zhao; Yao Zhao; Shangyan Zhou; Shunfeng Zhou; Qihao Zhu; Yuheng Zou

arXiv:2401.02954·cs.CL·January 8, 2024·87 cites

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI: Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai, Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao,, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo, Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang

PDF

Open Access 1 Repo 4 Models

TL;DR

DeepSeek LLM introduces a scalable open-source language model with a large, expanding dataset, demonstrating superior performance over existing models like LLaMA-2 and GPT-3.5 in code, math, and reasoning tasks.

Contribution

The paper presents DeepSeek LLM, a new open-source model with a long-term development perspective, supported by a massive dataset and improved training techniques.

Findings

01

DeepSeek LLM 67B outperforms LLaMA-2 70B on multiple benchmarks.

02

DeepSeek LLM 67B Chat surpasses GPT-3.5 in open-ended evaluations.

03

A 2 trillion token dataset supports effective training and scaling.

Abstract

The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deepseek-ai/deepseek-llm
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsMulti-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Linear Layer · Attention Dropout · {Dispute@FaQ-s}How to file a dispute with Expedia? · Residual Connection · Dense Connections · Linear Warmup With Cosine Annealing