Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios
Luohe Shi, Zuchao Li, Lefei Zhang, Baoyuan Qi, Guoming Liu, Hai Zhao

TL;DR
This paper introduces SpecFormer, a novel model architecture that combines autoregressive and non-autoregressive mechanisms to enable efficient, scalable speculative decoding for large language models, especially in large-batch scenarios.
Contribution
SpecFormer integrates unidirectional and bidirectional attention to enable parallel sequence generation, reducing computational costs and improving scalability in LLM inference.
Findings
SpecFormer achieves consistent acceleration in large-batch inference scenarios.
It reduces training demands and computational costs compared to existing methods.
Experimental results demonstrate improved inference speed across various model scales.
Abstract
Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing power, then generate a complex and massive draft tree using a small autoregressive language model to improve overall prediction accuracy. However, methods like batching have been widely applied in mainstream model inference systems as a superior alternative to speculative decoding, as they compress the available idle computing power. Therefore, performing speculative decoding with low verification resources and low scheduling costs has become an important research problem. We believe that more capable models that allow for parallel generation on draft sequences are what we truly need. Recognizing the fundamental nature of draft models to only generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Generative Adversarial Networks and Image Synthesis
