S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models

Tao He; Guang Huang; Yu Yang; Tianshi Xu; Sicheng Zhao; Guiguang Ding; Pengyang Wang; Feng Tian

arXiv:2506.14158·cs.CL·June 18, 2025

S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models

Tao He, Guang Huang, Yu Yang, Tianshi Xu, Sicheng Zhao, Guiguang Ding, Pengyang Wang, Feng Tian

PDF

Open Access

TL;DR

S$^4$C introduces a novel speculative sampling framework that leverages syntactic and semantic coherence to significantly accelerate large language model inference while maintaining output quality.

Contribution

It extends speculative sampling by incorporating multi-head drafting and a verification tree to improve efficiency and coherence in token generation.

Findings

01

Achieves 2.26x-2.60x acceleration on Spec-bench

02

Outperforms state-of-the-art methods in efficiency

03

Generates more valid tokens with fewer resources

Abstract

Large language models (LLMs) exhibit remarkable reasoning capabilities across diverse downstream tasks. However, their autoregressive nature leads to substantial inference latency, posing challenges for real-time applications. Speculative sampling mitigates this issue by introducing a drafting phase followed by a parallel validation phase, enabling faster token generation and verification. Existing approaches, however, overlook the inherent coherence in text generation, limiting their efficiency. To address this gap, we propose a Speculative Sampling with Syntactic and Semantic Coherence (S $^{4}$ C) framework, which extends speculative sampling by leveraging multi-head drafting for rapid token generation and a continuous verification tree for efficient candidate validation and feature reuse. Experimental results demonstrate that S $^{4}$ C surpasses baseline methods across mainstream tasks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Speech Recognition and Synthesis