Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou

TL;DR
Skywork-Reward-V2 introduces a large-scale, high-quality preference dataset and a human-AI collaborative pipeline to significantly improve open reward models' performance across various benchmarks.
Contribution
The paper presents SynPref-40M, a large preference dataset, and a human-AI curation pipeline, leading to the development of Skywork-Reward-V2, a versatile suite of reward models with state-of-the-art results.
Findings
Skywork-Reward-V2 outperforms existing reward models on seven benchmarks.
High-quality data curation enhances reward model effectiveness.
Human-AI synergy enables scalable, rigorous preference data collection.
Abstract
Despite the critical role of reward models (RMs) in Reinforcement Learning from Human Feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture nuanced human preferences. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present SynPref-40M, a large-scale preference dataset comprising 40 million preference pairs. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while LLMs perform automatic curation based on human guidance. Training on this preference mixture, we introduce…
Peer Reviews
Decision·ICLR 2026 Poster
The open contributions and deliverables include a new, large-scale, high-quality preference dataset, a valuable asset for the research community. It verified that the brittleness is a root cause of RM underperformance and proposed a solution to it, which is human-AI curation synergy, that contains an elegant hybrid of human and LLM curation, balancing quality and scalability. Empirically, Models (1.7B, 8B) achieve SOTA across seven benchmarks, outperforming much larger closed-source RMs.
The conclusion of this paper is favorable. However, for pairwise preference, it follows transitive rules. The quality of pairwise preferences can be compromised by intransitivity observed in human annotations. The paper below highlights the existence of such 'intransitivity': - https://arxiv.org/abs/2409.19325 (Duan et al, 2017) In a realistic world where an 'intransitive' relationship accumulates, quality control of the curated dataset is critical, but was not clarified in the proposed pipel
1. The paper demonstrates a clear and significant performance improvement 2. The release of a new series of top-performing RMs and the massive underlying dataset is a valuable contribution to the open-source ecosystem
The paper's primary weaknesses are twofold: (1) a pervasive lack of clarity and the omission of essential methodological details, and (2) as a result, it is difficult to determine the true source of the claimed performance improvements. **Lack of Clarity and Omission of Essential Details** The paper is extremely difficult to follow. The authors have clearly performed a massive amount of work, but the execution is not explained clearly, hindering reproducibility and full comprehension. Many ter
+ I assume that the dataset itself will be made available as promised. Obviously, a dataset at this scale and level of curation is a very strong contribution to the field + The reported benchmark scores are very impressive, clearly outperforming existing reward models + The evaluation of the trained reward models is thorough and very comprehensive + I really appreciate the provided ablation studies. Evaluating the trade-offs of data curation and dataset scaling is insightful. In particular, sect
- The actual dataset generation/curation process is missing details, in particular: What is the composition of the dataset (i.e., which categories does it contain? How are the prompts+responses formulated? What is the origin of prompts + responses? Does it contain multi-lingual samples?) Also, there are just no samples or insights given about the contents of the collected dataset. I think maybe it would also be possible to provide some more details in the main paper, important information like a
Code & Models
- 🤗Skywork/Skywork-Reward-V2-Qwen3-8Bmodel· 8.8k dl· ♡ 238.8k dl♡ 23
- 🤗Skywork/Skywork-Reward-V2-Llama-3.2-1Bmodel· 15k dl· ♡ 715k dl♡ 7
- 🤗Skywork/Skywork-Reward-V2-Llama-3.1-8Bmodel· 17k dl· ♡ 4117k dl♡ 41
- 🤗Skywork/Skywork-Reward-V2-Llama-3.1-8B-40Mmodel· 2.5k dl· ♡ 232.5k dl♡ 23
- 🤗Skywork/Skywork-Reward-V2-Qwen3-0.6Bmodel· 111k dl· ♡ 12111k dl♡ 12
- 🤗Skywork/Skywork-Reward-V2-Qwen3-1.7Bmodel· 5.0k dl· ♡ 75.0k dl♡ 7
- 🤗Skywork/Skywork-Reward-V2-Qwen3-4Bmodel· 2.7k dl· ♡ 102.7k dl♡ 10
- 🤗Skywork/Skywork-Reward-V2-Llama-3.2-3Bmodel· 711 dl· ♡ 4711 dl♡ 4
- 🤗DCAgent/Skywork-Reward-V2-Qwen3-0.6Bmodel· 55 dl55 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Mobile Crowdsensing and Crowdsourcing · Machine Learning and Data Classification
