Splitwiser: Efficient LM inference with constrained resources

Asad Aali; Adney Cardoza; Melissa Capo

arXiv:2505.03763·cs.AR·May 8, 2025

Splitwiser: Efficient LM inference with constrained resources

Asad Aali, Adney Cardoza, Melissa Capo

PDF

Open Access 1 Repo

TL;DR

Splitwiser is a novel approach that optimizes large language model inference by splitting phases onto the same GPU, reducing overhead and improving resource utilization, demonstrated on popular architectures.

Contribution

The paper introduces Splitwiser, a new method that consolidates inference phases on a single GPU to enhance efficiency and reduce data transfer overheads in LLMs.

Findings

01

Improved memory and cache utilization during inference.

02

Reduced data transfer overheads across devices.

03

Preliminary results show performance gains on Huggingface and vLLM architectures.

Abstract

Efficient inference of LLMs remains a crucial challenge, with two main phases: a compute-intensive prompt computation and a memory-intensive token generation. Despite existing batching and scheduling techniques, token generation phases fail to fully utilize compute resources, especially when compared to prompt computation phases. To address these challenges, we propose Splitwiser, a methodology that splits the two phases of an LLM inference request onto the same GPU, thereby reducing overhead and improving memory access and cache utilization. By eliminating the need to transfer data across devices, Splitwiser aims to minimize network-related overheads. In this report, we describe the basic structure of our proposed pipeline while sharing preliminary results and analysis. We implement our proposed multiprocessing design on two widely-used and independent LLM architectures: Huggingface…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

asad-aali/splitwiser
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies