Lever: Speculative LLM Inference on Smartphones

Tuowei Wang; Fengzu Li; Yanfan Sun; Wei Gao; Ju Ren

arXiv:2605.16786·cs.LG·May 19, 2026

Lever: Speculative LLM Inference on Smartphones

Tuowei Wang, Fengzu Li, Yanfan Sun, Wei Gao, Ju Ren

PDF

TL;DR

Lever is a system that enables efficient large language model inference on smartphones by optimizing speculative decoding across drafting, verification, and execution stages to reduce latency.

Contribution

It introduces Lever, a novel end-to-end system that jointly optimizes speculative decoding stages for mobile constraints, significantly improving inference speed.

Findings

01

Reduces inference latency by 2.93x over baseline flash-offloaded inference.

02

Achieves 1.50x latency reduction over conventional speculative decoding.

03

Narrows the latency gap between flash-backed and memory-resident LLM inference.

Abstract

Large language models (LLMs) are increasingly needed for interactive mobile applications, but high-quality models exceed the limited DRAM available on smartphones. Flash storage can hold larger models, yet flash-backed inference is slow because autoregressive decoding repeatedly invokes the target model and incurs costly I/O. We observe that speculative decoding is a natural fit for this setting: a small draft model can remain in DRAM, while a larger flash-resident target model verifies multiple candidate tokens per invocation. However, existing methods assume server-class accelerators and fail to account for prolonged I/O latency, limited computation parallelism, and irregular speculation execution. We present Lever, an end-to-end system for efficient flash-backed LLM inference on smartphones. Lever jointly optimizes the three stages of speculative decoding under mobile constraints.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.