HelpSteer2: Open-source dataset for training top-performing reward models
Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen,, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, Oleksii Kuchaiev

TL;DR
HelpSteer2 is a new open-source, high-quality preference dataset that enables training reward models to better align large language models, achieving state-of-the-art performance with fewer data than existing datasets.
Contribution
We introduce HelpSteer2, a permissively licensed preference dataset that improves reward model training efficiency and effectiveness, leading to superior LLM alignment results.
Findings
Achieved 92.0% SOTA score on Reward-Bench
Effective reward models trained with only 10,000 response pairs
Demonstrated improved LLM alignment with HelpSteer2-based reward models
Abstract
High-quality preference datasets are essential for training reward models that can effectively guide large language models (LLMs) in generating high-quality responses aligned with human preferences. As LLMs become stronger and better aligned, permissively licensed preference datasets, such as Open Assistant, HH-RLHF, and HelpSteer need to be updated to remain effective for reward modeling. Methods that distil preference data from proprietary LLMs such as GPT-4 have restrictions on commercial usage imposed by model providers. To improve upon both generated responses and attribute labeling quality, we release HelpSteer2, a permissively licensed preference dataset (CC-BY-4.0). Using a powerful internal base model trained on HelpSteer2, we are able to achieve the SOTA score (92.0%) on Reward-Bench's primary dataset, outperforming currently listed open and proprietary models, as of June…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/Nemotron-4-340B-Instructmodel· 3.2k dl· ♡ 6943.2k dl♡ 694
- 🤗nvidia/Llama3-70B-SteerLM-RMmodel· 11 dl· ♡ 4311 dl♡ 43
- 🤗nvidia/Llama3-70B-PPO-Chatmodel· 4 dl4 dl
- 🤗nvidia/Llama3-70B-SteerLM-Chatmodel· 4 dl· ♡ 54 dl♡ 5
- 🤗nvidia/Llama3-70B-DPO-Chatmodel· 3 dl· ♡ 33 dl♡ 3
- 🤗nvidia/Nemotron-4-340B-Rewardmodel· 37 dl· ♡ 12637 dl♡ 126
- 🤗failspy/Nemotron-4-340B-Instruct-SafeTensorsmodel· 7 dl· ♡ 227 dl♡ 22
- 🤗mgoin/Nemotron-4-340B-Instruct-vllmmodel· 49 dl49 dl
- 🤗mgoin/Nemotron-4-340B-Instruct-FP8-Dynamicmodel· 7 dl7 dl
- 🤗nvidia/Llama-3.1-Nemotron-70B-Rewardmodel· 62 dl· ♡ 8162 dl♡ 81
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare
MethodsResidual Connection · Softmax · Balanced Selection · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention
