SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained Devices
Xiangwen Zhuge, Xu Shen, Zeyu Wang, Fan Dang, Xuan Ding, Danyang Li, Yahui Han, Tianxiang Hao, Zheng Yang

TL;DR
SpecOffload introduces a novel inference engine that leverages speculative decoding to utilize latent GPU capacity, significantly improving throughput and core utilization for LLM inference on resource-limited devices.
Contribution
It proposes SpecOffload, a new system that embeds speculative decoding into offloading to unlock GPU resources and enhance inference efficiency on constrained devices.
Findings
GPU core utilization increased by 4.49x
Inference throughput improved by 2.54x
Achieved near-zero additional cost for acceleration
Abstract
Efficient LLM inference on resource-constrained devices presents significant challenges in compute and memory utilization. Due to limited GPU memory, existing systems offload model weights to CPU memory, incurring substantial I/O overhead between the CPU and GPU. This leads to two major inefficiencies: (1) GPU cores are underutilized, often remaining idle while waiting for data to be loaded; and (2) GPU memory has low impact on performance, as reducing its capacity has minimal effect on overall throughput.In this paper, we propose SpecOffload, a high-throughput inference engine that embeds speculative decoding into offloading. Our key idea is to unlock latent GPU resources for storing and executing a draft model used for speculative decoding, thus accelerating inference at near-zero additional cost. To support this, we carefully orchestrate the interleaved execution of target and draft…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper focuses on an important topic of efficient LLM serving. 2. It proposes a clever method to fill the draft model execution into the bubbles of offloading of the target model.
1. The planner part is not clear enough, nor necessary enough. 2. Some analysis and claim are not solid enough.
* Efficient LLM inference is important considering its increasing adoption in many applications. As such the paper is working on a good direction. * The paper aims to better utilize GPUs which is important and seems to get some gains. * Amalgamation of Offloading and Speculative Decoding seems to be a clear nice followup for the Offloading works and Speculative Decoding works.
* The paper seems to be rather shallow on the experiment depth (please refer to the questions for details). This significantly limits its generalizability.
1. The paper identifies an under-explored inefficiency — GPU idleness during offloading — and proposes to exploit this latent capacity through speculative decoding. 2. The dual-batch interleaved execution and adaptive tensor placement form a coherent system that effectively overlaps CPU compute, GPU compute, and I/O, where speculative decoding is repurposed not merely as a speed-up technique but as a way to hide I/O latency, which is novel in the offloading context. 3. The paper is well-written
1. The speculative decoding itself and its integration into offloading is not unique. That being said, the novelty lies primarily in system integration and scheduling. 2. All experiments are single-GPU; no exploration of distributed or multi-GPU extension.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Topic Modeling · Advanced Data Storage Technologies
