TL;DR
This paper introduces a novel online framework using contextual bandits to adaptively select the most suitable large language model for user queries, effectively handling unstructured prompt evolution without offline data.
Contribution
It presents the first contextual bandit approach for sequential LLM selection with unstructured prompt dynamics, including theoretical guarantees and practical extensions for cost and user preference considerations.
Findings
Outperforms existing LLM routing strategies in accuracy
Achieves lower costs in diverse benchmarks
Provides sublinear regret guarantees
Abstract
Large language models (LLMs) exhibit diverse response behaviors, costs, and strengths, making it challenging to select the most suitable LLM for a given user query. We study the problem of adaptive multi-LLM selection in an online setting, where the learner interacts with users through multi-step query refinement and must choose LLMs sequentially without access to offline datasets or model internals. A key challenge arises from unstructured context evolution: the prompt dynamically changes in response to previous model outputs via a black-box process, which cannot be simulated, modeled, or learned. To address this, we propose the first contextual bandit framework for sequential LLM selection under unstructured prompt dynamics. We formalize a notion of myopic regret and develop a LinUCB-based algorithm that provably achieves sublinear regret without relying on future context prediction.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
