Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models

Chenglong Wang; Yifu Huo; Yang Gan; Yongyu Mu; Qiaozhi He; Murun Yang; Bei Li; Chunliang Zhang; Tongran Liu; Anxiang Ma; Zhengtao Yu; Jingbo Zhu; Tong Xiao

arXiv:2511.12464·cs.CL·November 18, 2025

Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models

Chenglong Wang, Yifu Huo, Yang Gan, Yongyu Mu, Qiaozhi He, Murun Yang, Bei Li, Chunliang Zhang, Tongran Liu, Anxiang Ma, Zhengtao Yu, Jingbo Zhu, Tong Xiao

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces a multi-dimensional evaluation framework and analysis method for reward models, improving interpretability and alignment by probing preference representations across different preference dimensions.

Contribution

It presents MRMBench, a benchmark with six probing tasks for preference dimensions, and inference-time probing for better interpretability of reward models.

Findings

01

MRMBench correlates with LLM alignment performance

02

Reward models often struggle with multi-dimensional preferences

03

Inference-time probing improves reward prediction confidence

Abstract

Previous methods evaluate reward models by testing them on a fixed pairwise ranking test set, but they typically do not provide performance information on each preference dimension. In this work, we address the evaluation challenge of reward models by probing preference representations. To confirm the effectiveness of this evaluation method, we construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions. We design it to favor and encourage reward models that better capture preferences across different dimensions. Furthermore, we introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability. Through extensive experiments, we find that MRMBench strongly correlates with the alignment performance of large language models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ifnoc/MRMBench
dataset· 8 dl
8 dl

Videos

Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models· underline

Taxonomy

TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks