Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router
Chenyang Shao, Xinyang Liu, Yutang Lin, Fengli Xu, Yong Li

TL;DR
R2-Reasoner introduces a reinforced model router that enables heterogeneous LLMs to collaborate on reasoning tasks at a subtask level, significantly reducing costs while maintaining accuracy.
Contribution
The paper presents a novel framework with a reinforced router for fine-grained LLM collaboration, improving efficiency and scalability in reasoning tasks.
Findings
Reduces API costs by 84.46% compared to baselines.
Maintains competitive reasoning accuracy across six benchmarks.
Effectively coordinates heterogeneous models through supervised and reinforcement learning.
Abstract
Chain-of-thought has been proven essential for enhancing the complex reasoning abilities of Large Language Models (LLMs), but it also leads to high computational costs. Recent advances have explored the method to route queries among multiple models and proved it as a promising approach. However, previous works directly operate at the task level, i.e., assigning user queries to suitable LLMs, which does not allow hybrid LLMs to truly collaborate on finer-grained sub-tasks. Collaboration at the level of intermediate reasoning steps (thoughts) could enable more efficient coordination, but it also poses significant challenges for router scheduling, placing immense demands on the quality of task decomposition and the precision of the router. To address this, we propose R2-Reasoner, a novel framework centered around a Reinforced Model Router designed to efficiently scale LLM reasoning. This…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The framework demonstrates remarkable cost reduction (84.46%) while maintaining or even improving accuracy on several benchmarks, addressing a critical practical concern in deploying LLM reasoning systems at scale. 2. By operating at the subtask level rather than the task level, the approach enables more efficient utilization of model capabilities, matching computational resources to actual complexity requirements of individual reasoning steps. 3. The paper provides extensive experiments acro
1. Figure 1 fails to effectively communicate the method's core logic, particularly lacking clear visual distinction between the Downward Allocation and Upward Allocation processes, making it difficult to understand the iterative refinement strategy at a glance. 2. The use of maximum token probability as a proxy for task difficulty raises concerns about validity. The paper lacks justification for the specific threshold values (τ_easy, τ_medium, τ_hard) and doesn't clearly explain how the baseline
1. The shown results are solid. 2. Most claims are supported by results.
1. The authors highlights the gain in reducing inference cost while keeping a comparable (or better) performance on six reasoning tasks. However, training cost is not mentioned. As the training produce leverages a strong model as reward to refine the weak model's reward, I assume this would become a reasonable high cost. 2. The paper does not show a training process, it would be more helpful to provide training process in pictures. 3. It would be more helpful to extend the ablation studies sec
**Originality** Proposes one of the first frameworks to enable subtask-level routing with an alternating RL training scheme that co-optimizes decomposition and allocation. **Quality** Rigorous evaluation across six benchmarks (P3, SCAN, MATH, CHAMP, CSQA, MuSiQue) shows strong cost–accuracy trade-offs. Ablation studies support the value of the SFT+RL pipeline, with RL improving proxy measures of subtask quality and reducing allocation errors. **Clarity & Significance** Well-structured and
**Decomposition Evaluation** The reported 27% improvement in “subtask correctness” relies on automatic proxy metrics (e.g., coherence), not human-annotated gold standards, limiting confidence in the true reasoning fidelity. **Synthetic Training Data** The decomposer is trained on synthetically generated CoT traces, which may not reflect the diversity or nuance of human reasoning, risking brittleness on out-of-distribution queries. **Sparse Reward Signal** The binary final-answer reward prov
- The framework brings together (a) subtask decomposition, (b) difficulty estimation/allocation, and (c) reinforcement fine‐tuning of the router. While decomposition and routing have been studied, combining them in a staged RL training pipeline is interesting. The “Model Router” concept is clearly described. - The reported results (cost reduction ~86.85%) are impressive on the face of it. The authors show that their method achieves much lower API cost while maintaining accuracy in the benchmarks
- What exactly is *Meval* — a baseline evaluation model or a practical deployment reference? The paper states that Meval is used to estimate *practicality*, but does not specify its configuration. Given that different reasoning models generate reasoning traces of varying lengths, the selection of *Meval* could directly influence the reported API cost of R2-Reasoner. Clarification is also needed on whether *Meval* is an edge or cloud-based model, since this has implications for both latency and
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Big Data and Digital Economy · Topic Modeling
