POSS: Position Specialist Generates Better Draft for Speculative Decoding
Langlin Huang, Chengsong Huang, Jixuan Leng, Di Huang, Jiaxin Huang

TL;DR
This paper introduces Position Specialists (PosS), a novel approach that enhances speculative decoding for large language models by using position-specific draft layers, leading to improved token prediction accuracy and inference speed.
Contribution
The paper proposes Position Specialists, a new method with position-specific draft layers that mitigate error accumulation and improve decoding quality at later positions.
Findings
PosS increases average acceptance length.
PosS achieves higher speed-up ratios.
PosS outperforms baseline methods on multiple datasets.
Abstract
Speculative decoding accelerates Large Language Model (LLM) inference by using a small draft model to predict multiple tokens, and a large target model to verify these tokens in parallel. Recent studies leverage the hidden state of the target model to enhance draft model prediction accuracy. However, existing methods suffer from the degrading quality of draft token predictions at later positions, due to error accumulation in draft model generated features. In this paper, we propose Position Specialists (PosS), which consist of multiple position-specialized draft layers to generate tokens at assigned position(s). Position specialists greatly improve token acceptance rate at later positions per drafting round, as each specialist only needs to focus on handling a certain level of draft model feature deviation. Experiment results on Llama-3-8B-Instruct and Llama-2-13B-chat across six…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The core motivation is clear and compelling. The paper accurately identifies a critical and practical problem in existing methods: the "accumulated feature deviation" and the limited generalization capability of a single draft model. The paper is well-written, with a logical flow and clear exposition. 2. The design of PosS is elegant and intuitive, employing different "specialist" models to handle tasks of varying difficulty (i.e., different levels of feature deviation). The effectiveness of
1. The paper mentions "tree-draft" in Section 6.2 but does not sufficiently clarify its relationship with the experiments. It's unclear whether "draft depth" in Sections 6.2 and 6.3 refers to the depth of the tree or simply the length of a linear draft sequence. The core idea of PosS is to improve the quality of a single draft path. The paper should discuss PosS's orthogonality with parallel verification methods like tree-drafting. 2. Compared to a single draft model, POSS requires training and
- The introduction of the position-wise acceptance rate (pos-acc) metric provides a crucial analytical tool for diagnosing and comparing the efficiency of different speculative decoding methods at a granular level. - The Position Specialists (PoSS) architecture is a novel and intuitive approach that effectively mitigates the fundamental challenge of accumulated feature deviation by distributing the prediction task across multiple specialized draft layers.
1. The performance gain achieved by the proposed method is difficult to solely attribute to the "Position Specialists" architecture, as opposed to the "HASS-like" recursive feature alignment training strategy (which PoSS utilizes). Specifically, the incremental improvement of PosS-3(E2) over the HASS baseline is marginal (an average of only 2.1% on L3-8B and 0.3% on L2-13B, based on the speedup ratio from Table 2 at t=0). This small margin suggests that the performance lift might primarily stem
1. This paper is technically sound and easy to understand. 2. The experimental results show the effectiveness of the proposed method.
1. What is the difference between PosS and Medusa, which also uses different heads for different token positions? It seems that PosS-1 is exactly the same as Medusa. 2. Eq.4 seems degenerate to P(A_k), why don’t you describe Eq.4 like that? 3. What is the performance when changing the architecture of PosS-1 to Medusa and leaving other things unchanged?
Code & Models
- 🤗HINT-lab/PosS1-Llama3-8B-Instructmodel
- 🤗HINT-lab/PosS2-Llama3-8B-Instructmodel· 1 dl1 dl
- 🤗HINT-lab/PosS3-Llama3-8B-Instructmodel· 3 dl3 dl
- 🤗HINT-lab/PosS1-Llama2-13B-Chatmodel
- 🤗HINT-lab/PosS2-Llama2-13B-Chatmodel· 1 dl1 dl
- 🤗HINT-lab/PosS3-Llama2-13B-Chatmodel· 1 dl1 dl
- 🤗HINT-lab/EAGLE-Llama3-8B-Instruct-Reproducemodel
- 🤗HINT-lab/HASS-Llama3-8B-Instruct-Reproducemodel
- 🤗HINT-lab/PosS3-E3-Llama3.1-8B-Instructmodel· 1 dl1 dl
- 🤗HINT-lab/PosS134-Llama3.1-8B-Instructmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
MethodsFocus
