Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts
Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, Tong Zhang

TL;DR
This paper introduces an interpretable reward model for language models using multi-objective absolute ratings and a mixture-of-experts approach, improving alignment and transparency in RLHF.
Contribution
It proposes a novel two-stage method combining multi-objective absolute ratings with a mixture-of-experts gating network for interpretable reward modeling.
Findings
Achieved state-of-the-art performance on RewardBench.
Surpassed GPT-4 judges in reward evaluation accuracy.
Approached the performance of larger reward models with fewer parameters.
Abstract
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. The RLHF process typically starts by training a reward model (RM) using human preference data. Conventional RMs are trained on pairwise responses to the same user request, with relative ratings indicating which response humans prefer. The trained RM serves as a proxy for human preferences. However, due to the black-box nature of RMs, their outputs lack interpretability, as humans cannot intuitively understand why an RM thinks a response is good or not. As RMs act as human preference proxies, we believe they should be human-interpretable to ensure that their internal decision processes are consistent with human preferences and to prevent reward hacking in LLM alignment. To build RMs with interpretable preferences, we propose a two-stage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗RLHFlow/ArmoRM-Llama3-8B-v0.1model· 18k dl· ♡ 18318k dl♡ 183
- 🤗Magpie-Align/Llama-3-8B-Magpie-Align-v0.1model· 21 dl· ♡ 1021 dl♡ 10
- 🤗Magpie-Align/Llama-3-8B-Magpie-Align-v0.2model· 27 dl· ♡ 127 dl♡ 1
- 🤗Magpie-Align/Llama-3-8B-Magpie-Align-v0.3model· 7.6k dl· ♡ 37.6k dl♡ 3
- 🤗princeton-nlp/gemma-2-9b-it-SimPOmodel· 397 dl· ♡ 172397 dl♡ 172
- 🤗princeton-nlp/gemma-2-9b-it-DPOmodel· 31 dl· ♡ 931 dl♡ 9
- 🤗QuantFactory/gemma-2-9b-it-DPO-GGUFmodel· 89 dl· ♡ 389 dl♡ 3
- 🤗QuantFactory/gemma-2-9b-it-SimPO-GGUFmodel· 61 dl· ♡ 261 dl♡ 2
- 🤗QuantFactory/gemma-2-9b-it-SimPO-GGUF-v2model· 142 dl· ♡ 3142 dl♡ 3
- 🤗Magpie-Align/Llama-3.1-8B-Magpie-Align-v0.1model· 22 dl· ♡ 422 dl♡ 4
- princeton-nlp/llama3-ultrafeedback-armormdataset· 498 dl498 dl
- princeton-nlp/gemma2-ultrafeedback-armormdataset· 73 dl73 dl
- Magpie-Align/Magpie-Air-DPO-100K-v0.1dataset· 67 dl67 dl
- Magpie-Align/Magpie-Pro-DPO-100K-v0.1dataset· 165 dl165 dl
- Magpie-Align/Magpie-Llama-3.1-Pro-DPO-100K-v0.1dataset· 60 dl60 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Criteria Decision Making · Bayesian Modeling and Causal Inference
MethodsResidual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
