PRISM: Parametrically Refactoring Inference for Speculative Sampling Draft Models

Xuliang Wang; Yuetao Chen; Maochan Zhen; Fang Liu; Xinzhou Zheng; Xingwu Liu; Hong Xu; Ming Li

arXiv:2602.01762·cs.AI·February 3, 2026

PRISM: Parametrically Refactoring Inference for Speculative Sampling Draft Models

Xuliang Wang, Yuetao Chen, Maochan Zhen, Fang Liu, Xinzhou Zheng, Xingwu Liu, Hong Xu, Ming Li

PDF

Open Access

TL;DR

PRISM introduces an architectural innovation for large language models that decouples model capacity from inference cost, enabling faster decoding and better scalability without sacrificing draft quality.

Contribution

PRISM's disaggregated computation approach refactors draft models, outperforming existing architectures in speed and scalability for speculative sampling in LLMs.

Findings

01

PRISM achieves over 2.6x speedup in decoding throughput.

02

PRISM scales more effectively with larger data volumes.

03

PRISM maintains high draft quality with minimal latency.

Abstract

Large Language Models (LLMs), constrained by their auto-regressive nature, suffer from slow decoding. Speculative decoding methods have emerged as a promising solution to accelerate LLM decoding, attracting attention from both systems and AI research communities. Recently, the pursuit of better draft quality has driven a trend toward parametrically larger draft models, which inevitably introduces substantial computational overhead. While existing work attempts to balance the trade-off between prediction accuracy and compute latency, we address this fundamental dilemma through architectural innovation. We propose PRISM, which disaggregates the computation of each predictive step across different parameter sets, refactoring the computational pathways of draft models to successfully decouple model capacity from inference cost. Through extensive experiments, we demonstrate that PRISM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Generative Adversarial Networks and Image Synthesis