Towards Performance-Aware Allocation for Accelerated Machine Learning on   GPU-SSD Systems

Ayush Gundawar; Euijun Chung; Hyesoon Kim

arXiv:2412.04569·cs.AR·December 10, 2024

Towards Performance-Aware Allocation for Accelerated Machine Learning on GPU-SSD Systems

Ayush Gundawar, Euijun Chung, Hyesoon Kim

PDF

Open Access 1 Repo

TL;DR

This paper presents MQMS, a novel GPU-SSD system architecture that intelligently manages data placement and scheduling to significantly improve performance for large, data-intensive machine learning workloads.

Contribution

MQMS introduces a performance-aware, in-storage GPU architecture with dynamic address allocation and fine-grained mapping to optimize data handling and overcome bottlenecks.

Findings

01

Orders-of-magnitude improvements in I/O throughput.

02

Significant reductions in device response time.

03

Faster simulation end times for large workloads.

Abstract

The exponential growth of data-intensive machine learning workloads has exposed significant limitations in conventional GPU-accelerated systems, especially when processing datasets exceeding GPU DRAM capacity. We propose MQMS, an augmented in-storage GPU architecture and simulator that is aware of internal SSD states and operations, enabling intelligent scheduling and address allocation to overcome performance bottlenecks caused by CPU-mediated data access patterns. MQMS introduces dynamic address allocation to maximize internal parallelism and fine-grained address mapping to efficiently handle small I/O requests without incurring read-modify-write overheads. Through extensive evaluations on workloads ranging from large language model inference to classical machine learning algorithms, MQMS demonstrates orders-of-magnitude improvements in I/O request throughput, device response time,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ayushgun/mqms
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Distributed and Parallel Computing Systems · Cloud Computing and Resource Management