Consultant Decoding: Yet Another Synergistic Mechanism

Chuanghao Ding; Jiaping Wang; Ziqing Yang; Xiaoliang Wang; Dahua Lin; Cam-Tu Nguyen; Fei Tan

arXiv:2506.02391·cs.CL·June 4, 2025

Consultant Decoding: Yet Another Synergistic Mechanism

Chuanghao Ding, Jiaping Wang, Ziqing Yang, Xiaoliang Wang, Dahua Lin, Cam-Tu Nguyen, Fei Tan

PDF

Open Access 1 Video

TL;DR

This paper introduces Consultant Decoding, a novel mechanism that significantly accelerates large language model inference by verifying draft tokens with token-level likelihoods, reducing model calls and surpassing previous methods in speed and efficiency.

Contribution

The paper proposes Consultant Decoding, a new verification mechanism that improves inference speed and reduces model calls, outperforming existing speculative decoding methods.

Findings

01

Up to 2.5x inference speed increase

02

Reduces large model calls to below 10%

03

Surpasses the large target model's performance

Abstract

The synergistic mechanism based on Speculative Decoding (SD) has garnered considerable attention as a simple yet effective approach for accelerating the inference of large language models (LLMs). Nonetheless, the high rejection rates require repeated LLMs calls to validate draft tokens, undermining the overall efficiency gain of SD. In this work, we revisit existing verification mechanisms and propose a novel synergetic mechanism Consultant Decoding (CD). Unlike SD, which relies on a metric derived from importance sampling for verification, CD verifies candidate drafts using token-level likelihoods computed solely by the LLM. CD achieves up to a 2.5-fold increase in inference speed compared to the target model, while maintaining comparable generation quality (around 100% of the target model's performance). Interestingly, this is achieved by combining models whose parameter sizes differ…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Consultant Decoding: Yet Another Synergistic Mechanism· underline

Taxonomy

TopicsConflict Management and Negotiation

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings