Efficient, VRAM-Constrained xLM Inference on Clients

Aditya Ukarande; Deep Shekhar; Marc Blackstein; Ram Rangan

arXiv:2604.26334·cs.DC·April 30, 2026

Efficient, VRAM-Constrained xLM Inference on Clients

Aditya Ukarande, Deep Shekhar, Marc Blackstein, Ram Rangan

PDF

1 Repo

TL;DR

This paper introduces pipelined sharding, a novel CPU-GPU hybrid scheduling method that enables efficient, lossless inference of large language and vision-language models on client systems with limited VRAM.

Contribution

It presents a new model sharding technique combined with system optimizations for high-accuracy, VRAM-constrained inference of xLMs, including vision-language models, on client hardware.

Findings

01

TTFT improved by up to 6.7x for LLMs

02

TPS increased by up to 30x for LLMs

03

VRAM demand for CR1 inference reduced by 10x

Abstract

To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. To address this, we present pipelined sharding, a novel, benchmark-profile-guided CPU-GPU hybrid scheduling technique to achieve efficient, VRAM-constrained inference for both dense and mixture-of-experts (MoE) LLMs. Using a combination of model sharding at the sub-layer level, CPU offloading, pipelined copy-compute, and prioritized tensor placement in VRAM, it optimizes both time-to-first-token (TTFT) and tokens per second (TPS) metrics, while flexibly adapting to system and inference conditions. For efficient, high-accuracy VLM inference, we combine pipelined sharding with a llama $.$ cpp implementation of three well-understood prior ideas…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deepshnv/pipeshard-mlsys26-ae
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.