LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits
Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

TL;DR
LASeR introduces a multi-armed bandit approach to adaptively select reward models during LLM training, enhancing performance and efficiency across various tasks by dynamically choosing the most suitable reward model for each instance.
Contribution
This paper presents LASeR, a novel method that frames reward model selection as a multi-armed bandit problem, enabling adaptive and efficient training of LLMs with multiple RMs.
Findings
Improves Llama-3-8B accuracy by 2.67% on three datasets.
Achieves a 2x speedup in training efficiency.
Attains a 72.69% win rate on WildChat tasks.
Abstract
Reward Models (RMs) are crucial to aligning large language models (LLMs), but the degree to which an RM specialized to one task (e.g. writing) generalizes to new tasks (e.g. math) is often not known a priori, often making using only one fixed RM to train LLMs suboptimal. However, optimizing LLMs with multiple RMs simultaneously can incur a prohibitively high computational cost and lead to conflicting signals from different RMs that may degrade performance. To address these challenges, we introduce LASeR (Learning to Adaptively Select Rewards), which frames reward model selection as a multi-armed bandit problem, efficiently and iteratively training LLMs using multiple RMs by selecting the most well-suited RM for each instance. On commonsense and math reasoning tasks, we show that LASeR boosts iterative LLM training, improving the absolute average accuracy of Llama-3-8B over three…
Peer Reviews
Decision·Submitted to ICLR 2025
- The results are strong. They show improvements over their chosen baselines. - The paper is written clearly, and is highly readable. All the points are well covered.
- There is no baseline that selects more than 1 reward model over an epoch of training. The baselines are rather weak. What about a simple baseline that learns a classifier to choose the reward model based on the type of query. If Bandit algorithms perform better than such an approach, we can conclude that the method of using covariance works. Without any such baseline, we are left with the conclusion that one should choose the RM depending on the input c(t). Further to this, why use the authors
1. The proposed LASER iteratively trains LLMs using different RMs by dynamically selecting the most appropriate one for each training instance using a contextual bandit algorithm, specifically LinUCB, which effectively addresses the potential inefficiencies and conflicts present in ensemble methods that handle multiple RMs simultaneously. 2. The paper thoroughly evaluates LASeR across various datasets and tasks, showcasing its superior performance over baselines.
1. Fairness of Comparison: In Section 4, LASeR is compared with baselines that do not actively leverage information from interactions with the datasets. The RM selection in these baselines is either fully random or in offline fashion. Specifically, in Table 4, the "best RM" baseline (Zephyr-7b-alpha) is not the top performer on StrategyQA and MMLU, which contradicts claims in Appendix B. Additionally, the sequential RM selection shows better performance than other baselines across most datasets,
1. Innovative approach to automatically selecting Reward Models using bandit algorithm. This MAB framework, allowing the model to dynamically choose the most suitable RM at instance level based on contextual information (embedding of the last token). This avoid the computational burden of previous approaches that based on RM ensemble. 2. Empirical results presented in the paper indicate that LASER achieves performance improvements over traditional methods, such as ensemble RM scores. LASER achi
A notable weakness of the paper is the approach taken to set the reward in the MAB problem as the negative training loss. While maximizing the reward is equivalent to minimizing the training loss, this method raises concerns about the alignment of selected preference pairs with the language model. Specifically, if the MAB algorithm consistently selects RM that align with the LM (means that the preference pair generate by the RM has low loss in LM), it's essentially just reinforce the LM's existi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Advanced Bandit Algorithms Research · Machine Learning in Healthcare
