S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs
Wei Zhong, Manasa Bharadwaj

TL;DR
S3D introduces a cost-effective self-speculative decoding scheme for low-memory GPUs, significantly improving inference speed and memory efficiency for large language models with minimal architecture modifications.
Contribution
The paper presents S3D, a novel speculative decoding method that enhances speed and reduces memory usage on low-memory GPUs, outperforming existing systems with minimal changes.
Findings
S3D achieves high performance-memory ratio compared to recent SD systems.
S3D-based model is 1.4 to 2 times faster than quantized EAGLE.
The method operates efficiently in half-precision with less VRAM.
Abstract
Speculative decoding (SD) has attracted a significant amount of research attention due to the substantial speedup it can achieve for LLM inference. However, despite the high speedups they offer, speculative decoding methods often achieve optimal performance on high-end devices or with a substantial GPU memory overhead. Given limited memory and the necessity of quantization, a high-performing model on a high-end GPU can slow down by up to 7 times. To this end, we propose Skippy Simultaneous Speculative Decoding (or S3D), a cost-effective self-speculative SD method based on simultaneous multi-token decoding and mid-layer skipping. When compared against recent effective open-source SD systems, our method has achieved one of the top performance-memory ratios while requiring minimal architecture changes and training data. Leveraging our memory efficiency, we created a smaller yet more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · DNA and Biological Computing · Error Correcting Code Techniques
