A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses

Xiangxiang Dai; Yuejin Xie; Maoli Liu; Xuchuang Wang; Zhuohua Li; Huanyu Wang; John C.S. Lui

arXiv:2501.01849·cs.HC·November 12, 2025

A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses

Xiangxiang Dai, Yuejin Xie, Maoli Liu, Xuchuang Wang, Zhuohua Li, Huanyu Wang, John C.S. Lui

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents MACO, a multi-agent online learning framework for dynamically evaluating and selecting user-aligned responses from large language models, improving efficiency and response quality in diverse conversational settings.

Contribution

Introduces MACO, a novel multi-agent online learning approach with local elimination and adaptive strategies for efficient, user-aligned LLM response selection.

Findings

01

MACO achieves near-optimal regret bounds.

02

Outperforms baseline methods by at least 8.29%.

03

Effective across various response styles and model types.

Abstract

Prompt-based offline methods are commonly used to optimize large language model (LLM) responses, but evaluating these responses is computationally intensive and often fails to accommodate diverse response styles. This study introduces a novel online evaluation framework that employs a multi-agent conversational bandit model to select optimal responses while aligning with user preferences dynamically. To tackle challenges such as high-dimensional features, large response sets, adaptive conversational needs, and multi-device access, we propose MACO, Multi-Agent Conversational Online Learning, which comprises two key components: (1) \texttt{MACO-A}: Executed by local agents, it employs an online elimination mechanism to filter out low-quality responses. (2) \texttt{MACO-S}: Executed by the cloud server, it adaptively adjusts selection strategies based on aggregated preference data. An…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tarfersoul/maco
noneOfficial

Videos

A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses· underline

Taxonomy

TopicsEducational Technology and Assessment

MethodsADaptive gradient method with the OPTimal convergence rate