SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Talor Abramovich; Maor Ashkenazi; Carl (Izzy) Putterman; Benjamin Chislett; Tiyasa Mitra; Bita Darvish Rouhani; Ran Zilberstein; Yonatan Geifman

arXiv:2604.09557·cs.DC·April 14, 2026

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Talor Abramovich, Maor Ashkenazi, Carl (Izzy) Putterman, Benjamin Chislett, Tiyasa Mitra, Bita Darvish Rouhani, Ran Zilberstein, Yonatan Geifman

PDF

1 Repo

TL;DR

SPEED-Bench is a comprehensive benchmark suite designed to evaluate speculative decoding for large language models across diverse tasks and realistic serving scenarios, addressing limitations of existing benchmarks.

Contribution

It introduces a standardized, diverse, and production-relevant benchmark with data splits and integration for realistic SD performance assessment.

Findings

01

Synthetic inputs overestimate real-world throughput.

02

Batch-size affects optimal draft lengths and biases.

03

Vocabulary pruning impacts state-of-the-art drafters.

Abstract

Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nvidia/Model-Optimizer
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.