From Quarter to All: Accelerating Speculative LLM Decoding via Floating-Point Exponent Remapping and Parameter Sharing

Yushu Zhao; Yubin Qin; Yang Wang; Xiaolong Yang; Huiming Han; Shaojun Wei; Yang Hu; Shouyi Yin

arXiv:2510.18525·cs.AR·October 22, 2025

From Quarter to All: Accelerating Speculative LLM Decoding via Floating-Point Exponent Remapping and Parameter Sharing

Yushu Zhao, Yubin Qin, Yang Wang, Xiaolong Yang, Huiming Han, Shaojun Wei, Yang Hu, Shouyi Yin

PDF

Open Access

TL;DR

SPEQ is a novel speculative decoding approach that combines floating-point exponent remapping and parameter sharing, significantly accelerating large language model inference without extra training or storage overhead.

Contribution

It introduces a hardware-software co-designed method that forms a quantized draft model from full-model weights, enabling faster inference with minimal performance loss.

Findings

01

Achieves up to 2.07x speedup over FP16

02

Demonstrates effectiveness across 15 LLMs and tasks

03

Reduces inference latency without additional training

Abstract

Large language models achieve impressive performance across diverse tasks but exhibit high inference latency due to their large parameter sizes. While quantization reduces model size, it often leads to performance degradation compared to the full model. Speculative decoding remains lossless but typically incurs extra overheads. We propose SPEQ, an algorithm-hardware co-designed speculative decoding method that uses part of the full-model weight bits to form a quantized draft model, thereby eliminating additional training or storage overhead. A reconfigurable processing element array enables efficient execution of both the draft and verification passes. Experimental results across 15 LLMs and tasks demonstrate that SPEQ achieves speedups of 2.07x, 1.53x, and 1.45x compared over FP16, Olive, and Tender, respectively.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Generative Adversarial Networks and Image Synthesis