ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning

Timo Kaufmann; Yannick Metz; Daniel Keim; Eyke H\"ullermeier

arXiv:2512.25023·cs.LG·January 1, 2026

ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning

Timo Kaufmann, Yannick Metz, Daniel Keim, Eyke H\"ullermeier

PDF

Open Access 1 Video 1 Reviews

TL;DR

ResponseRank introduces a method to learn preference strength from noisy proxy signals, improving data efficiency and robustness in preference modeling for reinforcement learning and language tasks.

Contribution

It presents a novel approach that leverages local relative strength signals for robust preference strength learning, with empirical validation across multiple domains.

Findings

01

Enhanced sample efficiency in preference learning

02

Improved robustness to noisy signals

03

Effective across diverse tasks including language and RL

Abstract

Binary choices, as often used for reinforcement learning from human feedback (RLHF), convey only the direction of a preference. A person may choose apples over oranges and bananas over grapes, but which preference is stronger? Strength is crucial for decision-making under uncertainty and generalization of preference models, but hard to measure reliably. Metadata such as response times and inter-annotator agreement can serve as proxies for strength, but are often noisy and confounded. We propose ResponseRank to address the challenge of learning from noisy strength signals. Our method uses relative differences in proxy signals to rank responses to pairwise comparisons by their inferred preference strength. To control for systemic variation, we compare signals only locally within carefully constructed strata. This enables robust learning of utility differences consistent with…

Peer Reviews

Decision·NeurIPS 2025 poster

Reviewer 01Rating 3Confidence 3

Strengths

Strengths： 1. The authors use annotation time intervals as an indicator of preference strength, which allows incorporating preference degrees into the scoring information and should theoretically yield better results. 2. Meanwhile, the authors focus on addressing the lack of metrics for evaluating how well preference models capture preference degrees, introducing a measure based on the correlation coefficient between true scores and trained scores. Weakness： 1. Lacks training and testing resul

Videos

ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning· slideslive

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Emotion and Mood Recognition · Explainable Artificial Intelligence (XAI)