Taming the Memory Beast: Strategies for Reliable ML Training on Kubernetes
Jaideep Ray

TL;DR
This paper discusses effective memory management strategies in Kubernetes for reliable machine learning training, focusing on GPU memory, ephemeral storage, and best practices to prevent memory-related failures.
Contribution
It provides a comprehensive analysis of Kubernetes memory handling for ML workloads and offers practical guidelines to improve stability and scalability.
Findings
Identifies common memory management pitfalls in Kubernetes ML training
Provides best practices for stable memory utilization
Highlights importance of managing GPU memory and ephemeral storage
Abstract
Kubernetes offers a powerful orchestration platform for machine learning training, but memory management can be challenging due to specialized needs and resource constraints. This paper outlines how Kubernetes handles memory requests, limits, Quality of Service classes, and eviction policies for ML workloads, with special focus on GPU memory and ephemeral storage. Common pitfalls such as overcommitment, memory leaks, and ephemeral volume exhaustion are examined. We then provide best practices for stable, scalable memory utilization to help ML practitioners prevent out-of-memory events and ensure high-performance ML training pipelines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
Methodstravel james · Focus
