TL;DR
This paper introduces online mixture-of-experts algorithms that dynamically aggregate expert outputs for optimal decision-making, providing theoretical regret guarantees and demonstrating effectiveness in fine-tuning large language models.
Contribution
It proposes two novel algorithms for expert aggregation with regret guarantees and applies them to improve large language model responses in real-time.
Findings
Algorithms achieve low regret in bandit settings.
Empirical results show improved accuracy in LLM fine-tuning.
Theoretical analysis confirms regret bounds.
Abstract
We explore the use of expert-guided bandit learning, which we refer to as online mixture-of-experts (OMoE). In this setting, given a context, a candidate committee of experts must determine how to aggregate their outputs to achieve optimal results in terms of aggregate accuracy. We propose two algorithms to address this problem. The first algorithm combines aggregate voting with UCB-driven successive elimination, efficiently pruning suboptimal exploration actions. The second algorithm employs an online weighted-majority-voting mechanism, leveraging the respective voting power of each expert proportional to their predictive power. We derive theoretical guarantees for the regret properties in the bandit setting under ideal circumstances, and empirical results are provided accordingly. As a modern study on applications, these methods are applied to the online fine-tuning of a set of expert…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
