SpecMemo: Speculative Decoding is in Your Pocket

Selin Yildirim; Deming Chen

arXiv:2506.01986·cs.LG·June 4, 2025

SpecMemo: Speculative Decoding is in Your Pocket

Selin Yildirim, Deming Chen

PDF

Open Access

TL;DR

SpecMemo introduces a device-aware inference engine that enables efficient speculative decoding on memory-limited devices, significantly improving LLM inference speed and reducing memory usage for real-world applications.

Contribution

The paper presents a novel memory management approach for speculative decoding, allowing deployment on constrained devices and facilitating large model inference across multiple GPUs.

Findings

01

Achieves 96% throughput with 65% less memory on single GPU.

02

Demonstrates 2x speedup with multiple small GPUs.

03

Increases throughput by 8x with larger batch sizes.

Abstract

Recent advancements in speculative decoding have demonstrated considerable speedup across a wide array of large language model (LLM) tasks. Speculative decoding inherently relies on sacrificing extra memory allocations to generate several candidate tokens, of which acceptance rate drives the speedup. However, deploying speculative decoding on memory-constrained devices, such as mobile GPUs, remains as a significant challenge in real-world scenarios. In this work, we present a device-aware inference engine named SpecMemo that can smartly control memory allocations at finer levels to enable multi-turn chatbots with speculative decoding on such limited memory devices. Our methodology stems from theoretically modeling memory footprint of speculative decoding to determine a lower bound on the required memory budget while retaining speedup. SpecMemo empirically acquires a careful balance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications

MethodsBalanced Selection