TL;DR
R-Stitch is a hybrid decoding framework that dynamically routes reasoning tokens between small and large language models based on entropy, significantly accelerating inference while maintaining accuracy.
Contribution
It introduces a training-free, entropy-guided routing strategy and an adaptive policy extension for efficient, high-quality reasoning in large language models.
Findings
Achieves up to 4.10× speedup with negligible accuracy loss.
Effectively reduces inference complexity by delegating uncertain tokens.
Enables flexible efficiency-accuracy trade-offs without retraining.
Abstract
Chain-of-thought (CoT) enhances the problem-solving ability of large language models (LLMs) but incurs substantial inference cost due to long autoregressive trajectories. Existing acceleration strategies either shorten traces via early stopping or compression, or adopt speculative decoding with a smaller model. However, speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency, overlooking the observation that some smaller models, when correct, produce significantly more concise reasoning traces that could reduce inference length. We introduce R-Stitch, a training-free hybrid decoding framework that leverages token-level entropy as an uncertainty proxy to delegate computation between a small language model (SLM) and an LLM. Our analysis shows that high-entropy tokens are more likely to induce errors, motivating an entropy-guided…
Peer Reviews
Decision·Submitted to ICLR 2026
* The core algorithm is simple and training free in its base form, using a clear entropy threshold to switch between SLM and LLM in both directions so that the system exploits concise SLM spans without sacrificing reliability on high-uncertainty tokens. * The method is well motivated by an empirical analysis showing that incorrect answers have higher token entropy and that most tokens are very low entropy, which justifies entropy as a routing signal. * The systems design is thoughtful, with expl
* The router is only described as a “lightweight” module fed by hidden states; its architecture, parameter count, placement, and per-token overhead are not reported, so deployability and reproduction costs are unclear. * All latency results use a single GPU with batch size one. The current implementation only supports batch size one because switching happens at the token level. Real-world throughput under concurrent traffic is unknown. * The system runs two engines with separate KV caches. This
Strengths are in the above review.
Weaknesses are in the above review.
1. The observation of high entropy leading to incorrect trace is interesting, and well investigated. And the proposed method is weel aligned with the observation. 2. The experiments are thorough, with multiple LLMs and benchmarks, showing the benefits from R-Stitch. 3. The ablation study is well-designed, justifying the design choice.
1. The choice of SLM is not reasonable. L1-1.5B-Short is used as SLM, while the target model is DeepSeek-R1 family. As we know, SpecDec is efficient when both draft and target model's distribution is aligned. From Table 1, SpecDec's speedup is even worse than the target model alone, which is unreasonable. It's suggested to include new results with SLM from the same family. 2. Lack of baselines. Only two baselines are included here, LLM and SpecDec. It's suggested to include recent strong baselin
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
