Superpipeline: A Universal Approach for Reducing GPU Memory Usage in   Large Models

Reza Abbasi; Sernam Lim

arXiv:2410.08791·cs.LG·October 14, 2024

Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

Reza Abbasi, Sernam Lim

PDF

Open Access

TL;DR

Superpipeline is a flexible framework that reduces GPU memory usage in large models by dynamically managing layer transfers between GPU and CPU, enabling efficient training and inference on limited hardware without retraining.

Contribution

It introduces a novel, hardware-agnostic method for reducing GPU memory consumption in large models, applicable across various model types without retraining.

Findings

01

Reduces GPU memory usage by up to 60%.

02

Maintains model accuracy and acceptable processing speeds.

03

Applicable to LLMs, VLMs, and vision models.

Abstract

The rapid growth in machine learning models, especially in natural language processing and computer vision, has led to challenges when running these models on hardware with limited resources. This paper introduces Superpipeline, a new framework designed to optimize the execution of large AI models on constrained hardware during both training and inference. Our approach involves dynamically managing model execution by dividing models into individual layers and efficiently transferring these layers between GPU and CPU memory. Superpipeline reduces GPU memory usage by up to 60% in our experiments while maintaining model accuracy and acceptable processing speeds. This allows models that would otherwise exceed available GPU memory to run effectively. Unlike existing solutions that focus mainly on inference or specific model types, Superpipeline can be applied to large language models (LLMs),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques

MethodsFocus