Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff
Maximilian Holsman, Yukun Huang, Bhuwan Dhingra

TL;DR
Fuzzy Speculative Decoding (FSD) introduces a tunable decoding method that balances inference speed and generation quality by allowing controlled divergence from the target model, outperforming traditional SD in efficiency.
Contribution
FSD generalizes Speculative Decoding by enabling a flexible trade-off between accuracy and speed through divergence control, and can be integrated into existing decoding frameworks.
Findings
FSD achieves over 5 tokens/sec faster than SD with only 2% accuracy loss.
FSD matches SD accuracy at higher speeds, showing distributional equivalence isn't necessary.
FSD can be integrated into existing extensions like EAGLE-2 to improve efficiency.
Abstract
Speculative Decoding (SD) enforces strict distributional equivalence to the target model when accepting candidate tokens. While it maintains the target model's generation quality, this strict equivalence limits the speedup achievable by SD and prevents users from trading deviations from the target distribution in exchange for further inference speed gains. To address these limitations, we introduce Fuzzy Speculative Decoding (FSD) - a decoding algorithm that generalizes SD by accepting candidate tokens based on the divergences between the target and draft model distributions. By allowing for controlled divergence from the target model, FSD enables users to flexibly trade generation quality for inference speed. Across several benchmarks, our method is able to achieve significant runtime improvements of over 5 tokens per second faster than SD at only an approximate 2% absolute reduction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFormal Methods in Verification
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
