RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented   Generation for Preference Alignment

Zhuoran Jin; Hongbang Yuan; Tianyi Men; Pengfei Cao; Yubo Chen; Kang; Liu; Jun Zhao

arXiv:2412.13746·cs.CL·December 19, 2024

RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment

Zhuoran Jin, Hongbang Yuan, Tianyi Men, Pengfei Cao, Yubo Chen, Kang, Liu, Jun Zhao

PDF

Open Access 1 Repo 1 Datasets

TL;DR

RAG-RewardBench is a new benchmark designed to evaluate reward models in retrieval augmented generation, addressing challenges in preference alignment and guiding future improvements in RALMs.

Contribution

It introduces the first comprehensive benchmark for assessing reward models in RAG settings, including diverse scenarios and an efficient LLM-based annotation method.

Findings

01

Existing RALMs show minimal improvement in preference alignment.

02

The benchmark reveals limitations of current reward models in RAG scenarios.

03

The LLM-as-judge approach correlates well with human annotations.

Abstract

Despite the significant progress made by existing retrieval augmented language models (RALMs) in providing trustworthy responses and grounding in reliable sources, they often overlook effective alignment with human preferences. In the alignment process, reward models (RMs) act as a crucial proxy for human values to guide optimization. However, it remains unclear how to evaluate and select a reliable RM for preference alignment in RALMs. To this end, we propose RAG-RewardBench, the first benchmark for evaluating RMs in RAG settings. First, we design four crucial and challenging RAG-specific scenarios to assess RMs, including multi-hop reasoning, fine-grained citation, appropriate abstain, and conflict robustness. Then, we incorporate 18 RAG subsets, six retrievers, and 24 RALMs to increase the diversity of data sources. Finally, we adopt an LLM-as-a-judge approach to improve preference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jinzhuoran/rag-rewardbench
noneOfficial

Datasets

jinzhuoran/RAG-RewardBench
dataset· 474 dl
474 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Multi-Head Attention · Residual Connection · Adam · Layer Normalization · Weight Decay · Softmax · WordPiece