Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Chris Yuhao Liu; Liang Zeng; Yuzhen Xiao; Jujie He; Jiacai Liu; Chaojie Wang; Rui Yan; Wei Shen; Fuxiang Zhang; Jiacheng Xu; Yang Liu; Yahui Zhou

arXiv:2507.01352·cs.CL·March 4, 2026

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou

PDF

Open Access 9 Models 1 Datasets 3 Reviews

TL;DR

Skywork-Reward-V2 introduces a large-scale, high-quality preference dataset and a human-AI collaborative pipeline to significantly improve open reward models' performance across various benchmarks.

Contribution

The paper presents SynPref-40M, a large preference dataset, and a human-AI curation pipeline, leading to the development of Skywork-Reward-V2, a versatile suite of reward models with state-of-the-art results.

Findings

01

Skywork-Reward-V2 outperforms existing reward models on seven benchmarks.

02

High-quality data curation enhances reward model effectiveness.

03

Human-AI synergy enables scalable, rigorous preference data collection.

Abstract

Despite the critical role of reward models (RMs) in Reinforcement Learning from Human Feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture nuanced human preferences. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present SynPref-40M, a large-scale preference dataset comprising 40 million preference pairs. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while LLMs perform automatic curation based on human guidance. Training on this preference mixture, we introduce…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

The open contributions and deliverables include a new, large-scale, high-quality preference dataset, a valuable asset for the research community. It verified that the brittleness is a root cause of RM underperformance and proposed a solution to it, which is human-AI curation synergy, that contains an elegant hybrid of human and LLM curation, balancing quality and scalability. Empirically, Models (1.7B, 8B) achieve SOTA across seven benchmarks, outperforming much larger closed-source RMs.

Weaknesses

The conclusion of this paper is favorable. However, for pairwise preference, it follows transitive rules. The quality of pairwise preferences can be compromised by intransitivity observed in human annotations. The paper below highlights the existence of such 'intransitivity': - https://arxiv.org/abs/2409.19325 (Duan et al, 2017) In a realistic world where an 'intransitive' relationship accumulates, quality control of the curated dataset is critical, but was not clarified in the proposed pipel

Reviewer 02Rating 2Confidence 4

Strengths

1. The paper demonstrates a clear and significant performance improvement 2. The release of a new series of top-performing RMs and the massive underlying dataset is a valuable contribution to the open-source ecosystem

Weaknesses

The paper's primary weaknesses are twofold: (1) a pervasive lack of clarity and the omission of essential methodological details, and (2) as a result, it is difficult to determine the true source of the claimed performance improvements. **Lack of Clarity and Omission of Essential Details** The paper is extremely difficult to follow. The authors have clearly performed a massive amount of work, but the execution is not explained clearly, hindering reproducibility and full comprehension. Many ter

Reviewer 03Rating 6Confidence 3

Strengths

+ I assume that the dataset itself will be made available as promised. Obviously, a dataset at this scale and level of curation is a very strong contribution to the field + The reported benchmark scores are very impressive, clearly outperforming existing reward models + The evaluation of the trained reward models is thorough and very comprehensive + I really appreciate the provided ablation studies. Evaluating the trade-offs of data curation and dataset scaling is insightful. In particular, sect

Weaknesses

- The actual dataset generation/curation process is missing details, in particular: What is the composition of the dataset (i.e., which categories does it contain? How are the prompts+responses formulated? What is the origin of prompts + responses? Does it contain multi-lingual samples?) Also, there are just no samples or insights given about the contents of the collected dataset. I think maybe it would also be possible to provide some more details in the main paper, important information like a

Code & Models

Models

Datasets

OpenDataArena/OpenDataArena-scored-data-2603
dataset· 217 dl
217 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Mobile Crowdsensing and Crowdsourcing · Machine Learning and Data Classification