Cost-Effective Online Multi-LLM Selection with Versatile Reward Models

Xiangxiang Dai; Jin Li; Xutong Liu; Anqi Yu; John C.S. Lui

arXiv:2405.16587·cs.LG·October 3, 2024·1 cites

Cost-Effective Online Multi-LLM Selection with Versatile Reward Models

Xiangxiang Dai, Jin Li, Xutong Liu, Anqi Yu, John C.S. Lui

PDF

Open Access 3 Reviews

TL;DR

This paper introduces C2MAB-V, an online algorithm for cost-effective multi-LLM selection that balances reward and cost across diverse tasks using a combinatorial bandit approach with theoretical guarantees.

Contribution

The paper presents a novel online multi-armed bandit framework, C2MAB-V, for selecting multiple LLMs efficiently considering cost and reward, with proven theoretical guarantees and empirical validation.

Findings

01

C2MAB-V balances performance and cost across nine LLMs.

02

Theoretical guarantees match state-of-the-art regret bounds.

03

Empirical results demonstrate effectiveness in three application scenarios.

Abstract

With the rapid advancement of large language models (LLMs), the diversity of multi-LLM tasks and the variability in their pricing structures have become increasingly important, as costs can vary greatly between different LLMs. To tackle these challenges, we introduce the \textit{C2MAB-V}, a \underline{C}ost-effective \underline{C}ombinatorial \underline{M}ulti-armed \underline{B}andit with \underline{V}ersatile reward models for optimal LLM selection and usage. This online model differs from traditional static approaches or those reliant on a single LLM without cost consideration. With multiple LLMs deployed on a scheduling cloud and a local server dedicated to handling user queries, \textit{C2MAB-V} facilitates the selection of multiple LLMs over a combinatorial search space, specifically tailored for various collaborative task types with different reward models. Based on our designed…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 2

Strengths

I think the core explore-exploit algorithm that was introduced is new (to this reviewer), interesting, and also comes with potentially new regret guarantees.

Weaknesses

I have some concerns about the practical deployability of the approach.

Reviewer 02Rating 5Confidence 3

Strengths

This paper has the following strengths: + This paper consider an interesting problem which selects LLM models on cloud. + This paper provides good formulations for online LLM selection problem. Different reward models are considered. Budget constraint of the scheduling cloud is also modeled. + Performance analysis is provided for the combinatorial bandit problem.

Weaknesses

The paper can be improved in the following aspects. - This paper considers the LLM selection problem. However the reward are modeled as some simple equations in page 5 and look like some toy models. These models cannot accurately reflect the real evaluation of combinatorial LLMs. - The algorithm design is a simple extension of combinatorial bandits. The analysis of the algorithm also loos similar as the previous results on combinatorial bandits. It would be better if the authors can present the

Reviewer 03Rating 6Confidence 4

Strengths

1. The formulation of the problem as a cost-effective combinatorial multi-armed bandit with versatile reward models is interesting and applicable in real-world scenarios where cost-efficiency is crucial. The paper reads well and includes examples/intuition to help the reader. 2. The authors provide a rigorous theoretical analysis of the regret and budget violations of the proposed algorithm, achieving results that match the state-of-the-art in several degenerate cases. Plus, the case-by-case a

Weaknesses

1. The algorithms and theoretical analysis presented are natural extensions of previous approaches, specifically in the combinatorial multi-armed bandit literature. Also, I do not believe that non-linear rewards pose a significant challenge for the analysis, as the paper assumes that the reward function is Monotone and Lipschitz. Therefore, the regret analysis can be converted to the accumulated impact of overestimating rewards, just as most of the literature does. Hence, I cannot give a score h

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security