R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

Zhuokun Chen; Zeren Chen; Jiahao He; Lu Sheng; Mingkui Tan; Jianfei Cai; and Bohan Zhuang

arXiv:2507.17307·cs.LG·February 10, 2026

R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

Zhuokun Chen, Zeren Chen, Jiahao He, Lu Sheng, Mingkui Tan, Jianfei Cai, and Bohan Zhuang

PDF

3 Reviews

TL;DR

R-Stitch is a hybrid decoding framework that dynamically routes reasoning tokens between small and large language models based on entropy, significantly accelerating inference while maintaining accuracy.

Contribution

It introduces a training-free, entropy-guided routing strategy and an adaptive policy extension for efficient, high-quality reasoning in large language models.

Findings

01

Achieves up to 4.10× speedup with negligible accuracy loss.

02

Effectively reduces inference complexity by delegating uncertain tokens.

03

Enables flexible efficiency-accuracy trade-offs without retraining.

Abstract

Chain-of-thought (CoT) enhances the problem-solving ability of large language models (LLMs) but incurs substantial inference cost due to long autoregressive trajectories. Existing acceleration strategies either shorten traces via early stopping or compression, or adopt speculative decoding with a smaller model. However, speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency, overlooking the observation that some smaller models, when correct, produce significantly more concise reasoning traces that could reduce inference length. We introduce R-Stitch, a training-free hybrid decoding framework that leverages token-level entropy as an uncertainty proxy to delegate computation between a small language model (SLM) and an LLM. Our analysis shows that high-entropy tokens are more likely to induce errors, motivating an entropy-guided…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

* The core algorithm is simple and training free in its base form, using a clear entropy threshold to switch between SLM and LLM in both directions so that the system exploits concise SLM spans without sacrificing reliability on high-uncertainty tokens. * The method is well motivated by an empirical analysis showing that incorrect answers have higher token entropy and that most tokens are very low entropy, which justifies entropy as a routing signal. * The systems design is thoughtful, with expl

Weaknesses

* The router is only described as a “lightweight” module fed by hidden states; its architecture, parameter count, placement, and per-token overhead are not reported, so deployability and reproduction costs are unclear. * All latency results use a single GPU with batch size one. The current implementation only supports batch size one because switching happens at the token level. Real-world throughput under concurrent traffic is unknown. * The system runs two engines with separate KV caches. This

Reviewer 02Rating 6Confidence 2

Strengths

Strengths are in the above review.

Weaknesses

Weaknesses are in the above review.

Reviewer 03Rating 4Confidence 4

Strengths

1. The observation of high entropy leading to incorrect trace is interesting, and well investigated. And the proposed method is weel aligned with the observation. 2. The experiments are thorough, with multiple LLMs and benchmarks, showing the benefits from R-Stitch. 3. The ablation study is well-designed, justifying the design choice.

Weaknesses

1. The choice of SLM is not reasonable. L1-1.5B-Short is used as SLM, while the target model is DeepSeek-R1 family. As we know, SpecDec is efficient when both draft and target model's distribution is aligned. From Table 1, SpecDec's speedup is even worse than the target model alone, which is unreasonable. It's suggested to include new results with SLM from the same family. 2. Lack of baselines. Only two baselines are included here, LLM and SpecDec. It's suggested to include recent strong baselin

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.