A Systematic Analysis of Base Model Choice for Reward Modeling
Kian Ahrabian, Pegah Jandaghi, Negar Mokhberian, Sai Praneeth Karimireddy, Jay Pujara

TL;DR
This paper systematically analyzes how the choice of base models affects reward modeling performance in RLHF, revealing significant improvements and insights into benchmark correlations and model selection strategies.
Contribution
It provides a comprehensive analysis of base model effects on reward modeling, offering new methods to improve selection and performance prediction.
Findings
Performance can improve by up to 14% with optimal base model choice.
Combining benchmark results boosts model selection accuracy by 18%.
Post-training steps and data distribution estimates impact final reward model performance.
Abstract
Reinforcement learning from human feedback (RLHF) and, at its core, reward modeling have become a crucial part of training powerful large language models (LLMs). One commonly overlooked factor in training high-quality reward models (RMs) is the effect of the base model, which is becoming more challenging to choose given the rapidly growing pool of LLMs. In this work, we present a systematic analysis of the effect of base model selection on reward modeling performance. Our results show that the performance can be improved by up to 14% compared to the most common (i.e., default) choice. Moreover, we showcase the strong statistical relation between some existing benchmarks and downstream performances. We also demonstrate that the results from a small set of benchmarks could be combined to boost the model selection (18% on average in the top 5-10). Lastly, we illustrate the impact of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Explainable Artificial Intelligence (XAI) · Topic Modeling
MethodsSparse Evolutionary Training · Balanced Selection
