RelayLLM: Efficient Reasoning via Collaborative Decoding

Chengsong Huang; Tong Zheng; Langlin Huang; Jinyuan Li; Haolin Liu; Jiaxin Huang

arXiv:2601.05167·cs.CL·January 9, 2026

RelayLLM: Efficient Reasoning via Collaborative Decoding

Chengsong Huang, Tong Zheng, Langlin Huang, Jinyuan Li, Haolin Liu, Jiaxin Huang

PDF

Open Access

TL;DR

RelayLLM introduces a token-level collaborative decoding framework that enables small language models to handle most reasoning tasks independently, invoking large models only for critical tokens, thereby significantly reducing computational costs while maintaining high accuracy.

Contribution

The paper presents a novel token-level collaborative decoding approach and a training framework that allows small language models to efficiently delegate critical reasoning tokens to large models, reducing costs.

Findings

01

Achieves 49.52% accuracy across six benchmarks.

02

Invokes LLM for only 1.07% of tokens on average.

03

Reduces computational cost by 98.2% compared to random routing.

Abstract

Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Topic Modeling · Multimodal Machine Learning Applications