Optimized Speculative Sampling for GPU Hardware Accelerators
Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian, Riedhammer, Tobias Bocklet

TL;DR
This paper introduces optimized speculative sampling techniques for GPU accelerators, leveraging concurrent matrix computations and distribution approximations to significantly enhance sampling speed while maintaining accuracy.
Contribution
It presents novel GPU-based optimization strategies for speculative sampling, including concurrent matrix processing and sigmoid-based distribution approximation, achieving substantial speedups.
Findings
Profiling time improved by 6% to 13% with baseline optimizations.
Further speedups of 37% to 94% achieved with distribution approximation.
Validated effectiveness on speech recognition and summarization tasks.
Abstract
In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. This results in profiling time improvements ranging from 6% to 13% relative to the baseline implementation, without compromising accuracy. To further accelerate speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid. This approximation approach results in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a minor decline in accuracy. We conduct extensive experiments on both automatic speech recognition and summarization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEmbedded Systems Design Techniques · Sparse and Compressive Sensing Techniques · Advanced Data Compression Techniques
MethodsSoftmax
