TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification
Haoyun Jiang, Junqi He, Feng Hong, Xinlong Yang, Jianwei Zhang, Zheng Li, Zhengyang Zhuge, Zhiyong Chen, Bo Han, Junyang Lin, Jiangchao Yao

TL;DR
TriSpec introduces a ternary speculative decoding framework that uses a lightweight proxy to significantly reduce verification costs, enabling faster inference in large language models without sacrificing accuracy.
Contribution
The paper proposes TriSpec, a novel ternary speculative decoding method that reduces verification costs using a lightweight proxy, enhancing inference speed in large language models.
Findings
Achieves up to 35% speedup over standard SD.
Reduces target model invocations by up to 50%.
Maintains comparable accuracy with improved efficiency.
Abstract
Inference efficiency in Large Language Models (LLMs) is fundamentally limited by their serial, autoregressive generation, especially as reasoning becomes a key capability and response sequences grow longer. Speculative decoding (SD) offers a powerful solution, providing significant speed-ups through its lightweight drafting and parallel verification mechanism. While existing work has nearly saturated improvements in draft effectiveness and efficiency, this paper advances SD from a new yet critical perspective: the verification cost. We propose TriSpec, a novel ternary SD framework that, at its core, introduces a lightweight proxy to significantly reduce computational cost by approving easily verifiable draft sequences and engaging the full target model only when encountering uncertain tokens. TriSpec can be integrated with state-of-the-art SD methods like EAGLE-3 to further reduce…
Peer Reviews
Decision·Submitted to ICLR 2026
The writing is very clear and easy to follow. I particularly appreciate that the authors clearly illustrate the bottlenecks that current speculative decoding systems suffer from, as shown in Figure 2. The proposed approach—based on introducing a lightweight proxy verifier to reduce verification cost—is both reasonable and well motivated. In terms of experiments, the authors conduct comprehensive evaluations on five benchmarks across two metrics (accuracy and speedup), demonstrating consistent im
The hierarchical framework seems not entirely new, previous work such as triforce [1] also employs similar hierarchical framework. I understand there are some difference, but the authors should give some discussion between them. In addition, I find the preliminary observation in Figure 2(b) particularly interesting. However, I wonder whether this phenomenon persists under varied temperature settings. Intuitively, when the temperature is higher, the output distribution becomes smoother, which mi
- TriSpec is a simple and effective idea, using small models as verifiers for a fast single-layer drafter, similar to model cascades but for verification. - Across all domains presented in the paper, TriSpec demonstrates higher speedups compared to baselines while showing negligible performance loss compared to the target model. These results show that with the right proxy, the loss of the losslessness guarantee from classical speculative decoding will not adversely affect output quality.
- The paper only examines two model families, both based on Qwen: Qwen3 and DeepSeek-R1-Distilled-Qwen. Experiments on model families from other providers would strengthen the paper. In the paper’s current state, it is unclear whether the effectiveness of smaller model variants as proxy verifiers is particular to Qwen as a model provider. - The paper only examines two settings: math and code reasoning. These settings may be much more structured than more general domains, better suiting proxy mod
1. The paper identifies verification time as a first-order bottleneck in modern SD stacks and operationalizes a clear, reproducible fix: insert a same-family proxy and gate escalation with a top-1 vs. top-2 margin. The algorithm is simple to implement atop EAGLE-family drafters. 2. The presentation and the figures are intuitive and easy to understand. 3. The experiments show large reductions in target-invocation ratio and lower per-round verification time, while keeping acceptance length stable.
1. Novelty is limited versus recent verification-side work. While the motivation to reduce target calls with a cheaper verifier is straightforward, the idea of introducing a mid-level LLM into the draft model and the target model is well explored. 2. TriSpec achieves better speedup ratio at the cost of losing the theoretical lossless property of speculative decoding, which is especially important in real-world applications. The method can accept proxy-approved tokens that differ from the target
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
