Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models
Reza Abbasi, Sernam Lim

TL;DR
Superpipeline is a flexible framework that reduces GPU memory usage in large models by dynamically managing layer transfers between GPU and CPU, enabling efficient training and inference on limited hardware without retraining.
Contribution
It introduces a novel, hardware-agnostic method for reducing GPU memory consumption in large models, applicable across various model types without retraining.
Findings
Reduces GPU memory usage by up to 60%.
Maintains model accuracy and acceptable processing speeds.
Applicable to LLMs, VLMs, and vision models.
Abstract
The rapid growth in machine learning models, especially in natural language processing and computer vision, has led to challenges when running these models on hardware with limited resources. This paper introduces Superpipeline, a new framework designed to optimize the execution of large AI models on constrained hardware during both training and inference. Our approach involves dynamically managing model execution by dividing models into individual layers and efficiently transferring these layers between GPU and CPU memory. Superpipeline reduces GPU memory usage by up to 60% in our experiments while maintaining model accuracy and acceptable processing speeds. This allows models that would otherwise exceed available GPU memory to run effectively. Unlike existing solutions that focus mainly on inference or specific model types, Superpipeline can be applied to large language models (LLMs),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques
MethodsFocus
