Splitwiser: Efficient LM inference with constrained resources
Asad Aali, Adney Cardoza, Melissa Capo

TL;DR
Splitwiser is a novel approach that optimizes large language model inference by splitting phases onto the same GPU, reducing overhead and improving resource utilization, demonstrated on popular architectures.
Contribution
The paper introduces Splitwiser, a new method that consolidates inference phases on a single GPU to enhance efficiency and reduce data transfer overheads in LLMs.
Findings
Improved memory and cache utilization during inference.
Reduced data transfer overheads across devices.
Preliminary results show performance gains on Huggingface and vLLM architectures.
Abstract
Efficient inference of LLMs remains a crucial challenge, with two main phases: a compute-intensive prompt computation and a memory-intensive token generation. Despite existing batching and scheduling techniques, token generation phases fail to fully utilize compute resources, especially when compared to prompt computation phases. To address these challenges, we propose Splitwiser, a methodology that splits the two phases of an LLM inference request onto the same GPU, thereby reducing overhead and improving memory access and cache utilization. By eliminating the need to transfer data across devices, Splitwiser aims to minimize network-related overheads. In this report, we describe the basic structure of our proposed pipeline while sharing preliminary results and analysis. We implement our proposed multiprocessing design on two widely-used and independent LLM architectures: Huggingface…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies
