Faster LLM Inference via Sequential Monte Carlo

Yahya Emara; Mauricio Barba da Costa; Chi-Chih Chang; Cameron Freer; Tim Vieira; Ryan Cotterell; Mohamed S. Abdelfattah

arXiv:2604.15672·cs.LG·April 20, 2026

Faster LLM Inference via Sequential Monte Carlo

Yahya Emara, Mauricio Barba da Costa, Chi-Chih Chang, Cameron Freer, Tim Vieira, Ryan Cotterell, Mohamed S. Abdelfattah

PDF

TL;DR

This paper introduces SMC-SD, a novel inference method for large language models that improves speed by replacing rejection sampling with importance-weighted resampling, maintaining accuracy while significantly accelerating inference.

Contribution

The paper proposes sequential Monte Carlo speculative decoding (SMC-SD), a new approximate inference scheme that enhances inference speed without sacrificing accuracy, leveraging importance sampling and parallel computation.

Findings

01

SMC-SD achieves 2.36x speed-up over speculative decoding.

02

SMC-SD achieves 5.2x speed-up over autoregressive decoding.

03

SMC-SD maintains within 3% accuracy of the target model.

Abstract

Speculative decoding (SD) accelerates language model inference by drafting tokens from a cheap proposal model and verifying them against an expensive target model via rejection sampling. Because rejection truncates the draft block at the first error, throughput degrades when draft and target diverge. Rather than rejecting draft tokens outright, we propose to reweight them. To this end, we introduce sequential Monte Carlo speculative decoding (SMC-SD), which replaces token-level rejection with importance-weighted resampling over a population of draft particles. SMC-SD is a principled approximate inference scheme that trades exactness for additional speed, while preserving theoretical bounds on its per-step approximation error. Because LLM inference is memory bandwidth-bound, the arithmetic needed to draft particles and to score them in parallel comes nearly for free -- SMC-SD uses idle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.