TL;DR
This paper introduces pipelined sharding, a novel CPU-GPU hybrid scheduling method that enables efficient, lossless inference of large language and vision-language models on client systems with limited VRAM.
Contribution
It presents a new model sharding technique combined with system optimizations for high-accuracy, VRAM-constrained inference of xLMs, including vision-language models, on client hardware.
Findings
TTFT improved by up to 6.7x for LLMs
TPS increased by up to 30x for LLMs
VRAM demand for CR1 inference reduced by 10x
Abstract
To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. To address this, we present pipelined sharding, a novel, benchmark-profile-guided CPU-GPU hybrid scheduling technique to achieve efficient, VRAM-constrained inference for both dense and mixture-of-experts (MoE) LLMs. Using a combination of model sharding at the sub-layer level, CPU offloading, pipelined copy-compute, and prioritized tensor placement in VRAM, it optimizes both time-to-first-token (TTFT) and tokens per second (TPS) metrics, while flexibly adapting to system and inference conditions. For efficient, high-accuracy VLM inference, we combine pipelined sharding with a llamacpp implementation of three well-understood prior ideas…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
