Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning

Wenwen Si; Insup Lee; Osbert Bastani

arXiv:2605.06116·cs.AI·May 8, 2026

Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning

Wenwen Si, Insup Lee, Osbert Bastani

PDF

TL;DR

This paper introduces a reinforcement learning-based method for dynamic model routing during inference to improve reasoning accuracy while reducing costs in large language models.

Contribution

It formulates stepwise model routing as a decision problem and trains a small control policy to optimize performance-efficiency tradeoff without large reward models.

Findings

01

Consistently improves accuracy-cost tradeoff on math benchmarks.

02

Achieves comparable tradeoff to large reward model methods.

03

Validates effectiveness on multiple open and closed models.

Abstract

Inference-time computation has greatly enhanced the performance of large language models (LLMs) on challenging reasoning tasks, but this strategy can incur high inference costs. One solution is to route intermediate chain-of-thought (CoT) states to language models of different sizes; however, existing approaches rely on handcrafted routing strategies that limit performance, or on training large process reward models that may be infeasible in many applications. We formulate stepwise model routing as a constrained decision-making problem, which we solve by training a small control policy using reinforcement learning in conjunction with threshold calibration to tune the performance-efficiency tradeoff. We validate our method on three math benchmarks (GSM8K, MATH500, and OmniMath) on both open and closed models. Our method consistently improves the accuracy-cost tradeoff compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.