Memory Analysis on the Training Course of DeepSeek Models

Ping Zhang; Lei Su

arXiv:2502.07846·cs.PF·February 13, 2025

Memory Analysis on the Training Course of DeepSeek Models

Ping Zhang, Lei Su

PDF

Open Access

TL;DR

This paper provides a theoretical analysis of GPU memory consumption during the training of large-scale DeepSeek models, focusing on factors like micro-batch size, parallelism, and optimization strategies.

Contribution

It offers a detailed understanding of device-level memory requirements and dynamics for training large-scale mixture-of-experts models under various configurations.

Findings

01

Analyzes impact of micro-batch size and parallelism on memory usage

02

Clarifies effects of activation recomputation and ZeRO optimizations

03

Provides insights into memory consumption patterns for large-scale models

Abstract

We present a theoretical analysis of GPU memory consumption during the training of DeepSeek models such as DeepSeek-v2 and DeepSeek-v3. Our primary objective is to clarify the device-level memory requirements associated with various distributed training configurations. Specifically, we examine critical factors influencing memory usage, including micro-batch size, activation recomputation policies, 3D parallelism, and ZeRO optimizations. It is important to emphasize that the training policies discussed in this report are not representative of DeepSeek's official configurations. Instead, they are explored to provide a deeper understanding of memory dynamics in training of large-scale mixture-of-experts model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management

MethodsZeRO