Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router

Chenyang Shao; Xinyang Liu; Yutang Lin; Fengli Xu; Yong Li

arXiv:2506.05901·cs.CL·December 5, 2025

Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router

Chenyang Shao, Xinyang Liu, Yutang Lin, Fengli Xu, Yong Li

PDF

Open Access 4 Reviews

TL;DR

R2-Reasoner introduces a reinforced model router that enables heterogeneous LLMs to collaborate on reasoning tasks at a subtask level, significantly reducing costs while maintaining accuracy.

Contribution

The paper presents a novel framework with a reinforced router for fine-grained LLM collaboration, improving efficiency and scalability in reasoning tasks.

Findings

01

Reduces API costs by 84.46% compared to baselines.

02

Maintains competitive reasoning accuracy across six benchmarks.

03

Effectively coordinates heterogeneous models through supervised and reinforcement learning.

Abstract

Chain-of-thought has been proven essential for enhancing the complex reasoning abilities of Large Language Models (LLMs), but it also leads to high computational costs. Recent advances have explored the method to route queries among multiple models and proved it as a promising approach. However, previous works directly operate at the task level, i.e., assigning user queries to suitable LLMs, which does not allow hybrid LLMs to truly collaborate on finer-grained sub-tasks. Collaboration at the level of intermediate reasoning steps (thoughts) could enable more efficient coordination, but it also poses significant challenges for router scheduling, placing immense demands on the quality of task decomposition and the precision of the router. To address this, we propose R2-Reasoner, a novel framework centered around a Reinforced Model Router designed to efficiently scale LLM reasoning. This…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. The framework demonstrates remarkable cost reduction (84.46%) while maintaining or even improving accuracy on several benchmarks, addressing a critical practical concern in deploying LLM reasoning systems at scale. 2. By operating at the subtask level rather than the task level, the approach enables more efficient utilization of model capabilities, matching computational resources to actual complexity requirements of individual reasoning steps. 3. The paper provides extensive experiments acro

Weaknesses

1. Figure 1 fails to effectively communicate the method's core logic, particularly lacking clear visual distinction between the Downward Allocation and Upward Allocation processes, making it difficult to understand the iterative refinement strategy at a glance. 2. The use of maximum token probability as a proxy for task difficulty raises concerns about validity. The paper lacks justification for the specific threshold values (τ_easy, τ_medium, τ_hard) and doesn't clearly explain how the baseline

Reviewer 02Rating 4Confidence 3

Strengths

1. The shown results are solid. 2. Most claims are supported by results.

Weaknesses

1. The authors highlights the gain in reducing inference cost while keeping a comparable (or better) performance on six reasoning tasks. However, training cost is not mentioned. As the training produce leverages a strong model as reward to refine the weak model's reward, I assume this would become a reasonable high cost. 2. The paper does not show a training process, it would be more helpful to provide training process in pictures. 3. It would be more helpful to extend the ablation studies sec

Reviewer 03Rating 6Confidence 3

Strengths

**Originality** Proposes one of the first frameworks to enable subtask-level routing with an alternating RL training scheme that co-optimizes decomposition and allocation. **Quality** Rigorous evaluation across six benchmarks (P3, SCAN, MATH, CHAMP, CSQA, MuSiQue) shows strong cost–accuracy trade-offs. Ablation studies support the value of the SFT+RL pipeline, with RL improving proxy measures of subtask quality and reducing allocation errors. **Clarity & Significance** Well-structured and

Weaknesses

**Decomposition Evaluation** The reported 27% improvement in “subtask correctness” relies on automatic proxy metrics (e.g., coherence), not human-annotated gold standards, limiting confidence in the true reasoning fidelity. **Synthetic Training Data** The decomposer is trained on synthetically generated CoT traces, which may not reflect the diversity or nuance of human reasoning, risking brittleness on out-of-distribution queries. **Sparse Reward Signal** The binary final-answer reward prov

Reviewer 04Rating 2Confidence 4

Strengths

- The framework brings together (a) subtask decomposition, (b) difficulty estimation/allocation, and (c) reinforcement fine‐tuning of the router. While decomposition and routing have been studied, combining them in a staged RL training pipeline is interesting. The “Model Router” concept is clearly described. - The reported results (cost reduction ~86.85%) are impressive on the face of it. The authors show that their method achieves much lower API cost while maintaining accuracy in the benchmarks

Weaknesses

- What exactly is *Meval* — a baseline evaluation model or a practical deployment reference? The paper states that Meval is used to estimate *practicality*, but does not specify its configuration. Given that different reasoning models generate reasoning traces of varying lengths, the selection of *Meval* could directly influence the reported API cost of R2-Reasoner. Clarification is also needed on whether *Meval* is an edge or cloud-based model, since this has implications for both latency and

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Big Data and Digital Economy · Topic Modeling