Interpretable Preferences via Multi-Objective Reward Modeling and   Mixture-of-Experts

Haoxiang Wang; Wei Xiong; Tengyang Xie; Han Zhao; Tong Zhang

arXiv:2406.12845·cs.LG·June 19, 2024

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts

Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, Tong Zhang

PDF

Open Access 2 Repos 10 Models 5 Datasets

TL;DR

This paper introduces an interpretable reward model for language models using multi-objective absolute ratings and a mixture-of-experts approach, improving alignment and transparency in RLHF.

Contribution

It proposes a novel two-stage method combining multi-objective absolute ratings with a mixture-of-experts gating network for interpretable reward modeling.

Findings

01

Achieved state-of-the-art performance on RewardBench.

02

Surpassed GPT-4 judges in reward evaluation accuracy.

03

Approached the performance of larger reward models with fewer parameters.

Abstract

Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. The RLHF process typically starts by training a reward model (RM) using human preference data. Conventional RMs are trained on pairwise responses to the same user request, with relative ratings indicating which response humans prefer. The trained RM serves as a proxy for human preferences. However, due to the black-box nature of RMs, their outputs lack interpretability, as humans cannot intuitively understand why an RM thinks a response is good or not. As RMs act as human preference proxies, we believe they should be human-interpretable to ensure that their internal decision processes are consistent with human preferences and to prevent reward hacking in LLM alignment. To build RMs with interpretable preferences, we propose a two-stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Criteria Decision Making · Bayesian Modeling and Causal Inference

MethodsResidual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer