SpecMemo: Speculative Decoding is in Your Pocket
Selin Yildirim, Deming Chen

TL;DR
SpecMemo introduces a device-aware inference engine that enables efficient speculative decoding on memory-limited devices, significantly improving LLM inference speed and reducing memory usage for real-world applications.
Contribution
The paper presents a novel memory management approach for speculative decoding, allowing deployment on constrained devices and facilitating large model inference across multiple GPUs.
Findings
Achieves 96% throughput with 65% less memory on single GPU.
Demonstrates 2x speedup with multiple small GPUs.
Increases throughput by 8x with larger batch sizes.
Abstract
Recent advancements in speculative decoding have demonstrated considerable speedup across a wide array of large language model (LLM) tasks. Speculative decoding inherently relies on sacrificing extra memory allocations to generate several candidate tokens, of which acceptance rate drives the speedup. However, deploying speculative decoding on memory-constrained devices, such as mobile GPUs, remains as a significant challenge in real-world scenarios. In this work, we present a device-aware inference engine named SpecMemo that can smartly control memory allocations at finer levels to enable multi-turn chatbots with speculative decoding on such limited memory devices. Our methodology stems from theoretically modeling memory footprint of speculative decoding to determine a lower bound on the required memory budget while retaining speedup. SpecMemo empirically acquires a careful balance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications
MethodsBalanced Selection
