Reward-Based Online LLM Routing via NeuralUCB

Ming-Hua Tsai; Phat Tran

arXiv:2603.30035·cs.LG·April 1, 2026

Reward-Based Online LLM Routing via NeuralUCB

Ming-Hua Tsai, Phat Tran

PDF

TL;DR

This paper proposes NeuralUCB for cost-aware LLM routing, demonstrating improved utility and reduced inference costs in a simulated online setting, with promising potential and some remaining challenges.

Contribution

It introduces a NeuralUCB-based routing policy for LLMs, showing its effectiveness over baselines in utility and cost efficiency.

Findings

01

NeuralUCB outperforms random and min-cost baselines in utility reward.

02

The method achieves lower inference costs while maintaining competitive reward.

03

Remaining challenges include action discrimination and exploration.

Abstract

This study investigates the use of NeuralUCB for cost-aware large language model (LLM) routing. Existing routing approaches can be broadly grouped into supervised routing methods and partial-feedback methods, each with different tradeoffs in efficiency and adaptivity. We implement a NeuralUCB-based routing policy and evaluate it on RouterBench under a simulated online setting. Experimental results show that the proposed method consistently outperforms random and min-cost baselines in utility reward. Compared with the max-quality reference, our method achieves substantially lower inference cost while maintaining competitive reward. These findings suggest that NeuralUCB is a promising approach for cost-aware LLM routing, while also highlighting remaining challenges in action discrimination and exploration.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.