Taming the Memory Beast: Strategies for Reliable ML Training on   Kubernetes

Jaideep Ray

arXiv:2412.14701·cs.DC·December 30, 2024

Taming the Memory Beast: Strategies for Reliable ML Training on Kubernetes

Jaideep Ray

PDF

Open Access

TL;DR

This paper discusses effective memory management strategies in Kubernetes for reliable machine learning training, focusing on GPU memory, ephemeral storage, and best practices to prevent memory-related failures.

Contribution

It provides a comprehensive analysis of Kubernetes memory handling for ML workloads and offers practical guidelines to improve stability and scalability.

Findings

01

Identifies common memory management pitfalls in Kubernetes ML training

02

Provides best practices for stable memory utilization

03

Highlights importance of managing GPU memory and ephemeral storage

Abstract

Kubernetes offers a powerful orchestration platform for machine learning training, but memory management can be challenging due to specialized needs and resource constraints. This paper outlines how Kubernetes handles memory requests, limits, Quality of Service classes, and eviction policies for ML workloads, with special focus on GPU memory and ephemeral storage. Common pitfalls such as overcommitment, memory leaks, and ephemeral volume exhaustion are examined. We then provide best practices for stable, scalable memory utilization to help ML practitioners prevent out-of-memory events and ensure high-performance ML training pipelines.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

Methodstravel james · Focus