Tool-Augmented Reward Modeling

Lei Li; Yekun Chai; Shuohuan Wang; Yu Sun; Hao Tian; Ningyu Zhang; Hua; Wu

arXiv:2310.01045·cs.CL·February 13, 2024·2 cites

Tool-Augmented Reward Modeling

Lei Li, Yekun Chai, Shuohuan Wang, Yu Sun, Hao Tian, Ningyu Zhang, Hua, Wu

PDF

Open Access 1 Repo 1 Models 1 Datasets 1 Video 3 Reviews

TL;DR

This paper introduces Themis, a tool-augmented reward modeling approach that enhances large language models' ability to perform arithmetic, code execution, and factual lookups by integrating external tools, leading to improved alignment and performance.

Contribution

The paper presents a novel method for integrating external tools into reward models, improving their interpretability, reliability, and task performance in preference modeling and RLHF.

Findings

01

17.7% overall improvement in preference ranking across eight tasks

02

Outperforms Gopher 280B by 7.3% on TruthfulQA in zero-shot evaluation

03

RLHF with Themis achieves 32% win rate over baselines in human evaluations

Abstract

Reward modeling (a.k.a., preference modeling) is instrumental for aligning large language models with human preferences, particularly within the context of reinforcement learning from human feedback (RLHF). While conventional reward models (RMs) have exhibited remarkable scalability, they oft struggle with fundamental functionality such as arithmetic computation, code execution, and factual lookup. In this paper, we propose a tool-augmented preference modeling approach, named Themis, to address these limitations by empowering RMs with access to external environments, including calculators and search engines. This approach not only fosters synergy between tool utilization and reward grading but also enhances interpretive capacity and scoring reliability. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources and construct…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

(1) The paper addresses an important issue in reward modeling by introducing a tool-augmented approach to enhance the effectiveness of RMs. (2) The proposed methodology of integrating external tools into RMs is innovative and practical, allowing for dynamic decision-making and reasoning processes. (3) The experimental results demonstrate significant improvements in preference ranking and outperformance of Themis compared to baseline RMs, validating the effectiveness of the approach.

Weaknesses

The description of the method is not very clear. My understanding is that the reward model first generates some explanations based on the inputted question and answer, and then connects a fully connected layer to the final hidden state to produce a scalar reward.

Reviewer 02Rating 8· accept, good paperConfidence 3

Strengths

- Having a level of reasoning and interpretability is great feature to have for reward models - The experiments and provided implementation details look comprehensive

Weaknesses

1- How to trust tools is an important aspect to consider here. At least in the examples, it looks like there is a risk of biasing the reward model and generative model to outputs of specific tools being used. This could be concerning as tools are not necessarily unbiased. 2- It is not entirely clear how GPT-4 is used to generate RM training data. Note that GPT-4 itself is a system if the proposal is to use GPT-4 to train RM, one can argue why not directly train RM on GPT-4 data or use GPT-4 dir

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

**Strength 1**: The idea of augmenting reward models with tools is very interesting, novel, and timely. **Strength 2**: The proposed method provides a nice and logical way for tools to be included in the reward design process. **Strength 3**: This paper provides some interesting experiments such as application to RLHF and scaling experiments.

Weaknesses

**Weakness 1**: One of my main concerns is lack of experiments on standard reward modeling datasets. There are many datasets not included in the paper such as the Anthropic HH dataset, Stack Overflow, OpenAI WebGPT, and ChatGPT comparisons datasets. They do conduct analysis on a small portion of the HH dataset, but not on the provided testing set. In addition, they show worst test accuracy than is reported in some other papers that only use conventional reward modeling [1]. Since the main claim

Code & Models

Repositories

ernie-research/Tool-Augmented-Reward-Model
pytorchOfficial

Models

🤗
ernie-research/Themis-7b
model· 4 dl· ♡ 4
4 dl♡ 4

Datasets

ernie-research/TARA
dataset· 21 dl
21 dl

Videos

Tool-Augmented Reward Modeling· slideslive

Taxonomy

TopicsBusiness Process Modeling and Analysis · Corporate Governance and Management