Lever: Speculative LLM Inference on Smartphones
Tuowei Wang, Fengzu Li, Yanfan Sun, Wei Gao, Ju Ren

TL;DR
Lever is a system that enables efficient large language model inference on smartphones by optimizing speculative decoding across drafting, verification, and execution stages to reduce latency.
Contribution
It introduces Lever, a novel end-to-end system that jointly optimizes speculative decoding stages for mobile constraints, significantly improving inference speed.
Findings
Reduces inference latency by 2.93x over baseline flash-offloaded inference.
Achieves 1.50x latency reduction over conventional speculative decoding.
Narrows the latency gap between flash-backed and memory-resident LLM inference.
Abstract
Large language models (LLMs) are increasingly needed for interactive mobile applications, but high-quality models exceed the limited DRAM available on smartphones. Flash storage can hold larger models, yet flash-backed inference is slow because autoregressive decoding repeatedly invokes the target model and incurs costly I/O. We observe that speculative decoding is a natural fit for this setting: a small draft model can remain in DRAM, while a larger flash-resident target model verifies multiple candidate tokens per invocation. However, existing methods assume server-class accelerators and fail to account for prolonged I/O latency, limited computation parallelism, and irregular speculation execution. We present Lever, an end-to-end system for efficient flash-backed LLM inference on smartphones. Lever jointly optimizes the three stages of speculative decoding under mobile constraints.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
