Precision over Diversity: High-Precision Reward Generalizes to Robust Instruction Following

Yirong Zeng; Yufei Liu; Xiao Ding; Yutai Hou; Yuxian Wang; Haonan Song; Wu Ning; Dandan Tu; Qixun Zhang; Bibo Cai; Yuxiang He; Ting Liu

arXiv:2601.04954·cs.LG·January 14, 2026

Precision over Diversity: High-Precision Reward Generalizes to Robust Instruction Following

Yirong Zeng, Yufei Liu, Xiao Ding, Yutai Hou, Yuxian Wang, Haonan Song, Wu Ning, Dandan Tu, Qixun Zhang, Bibo Cai, Yuxiang He, Ting Liu

PDF

Open Access

TL;DR

This paper demonstrates that high-precision rewards, focusing on accuracy rather than diversity, lead to better instruction-following performance and robustness in large language models, challenging the traditional emphasis on diverse training data.

Contribution

The study shows that training with hard-only, high-precision rewards outperforms mixed datasets, proposing a data refinement strategy that enhances efficiency and generalization in instruction following.

Findings

01

High-precision rewards improve model performance by 13.4%.

02

Training with hard-only constraints reduces training time by 58%.

03

Reward precision is more critical than constraint diversity for effective alignment.

Abstract

A central belief in scaling reinforcement learning with verifiable rewards for instruction following (IF) tasks is that, a diverse mixture of verifiable hard and unverifiable soft constraints is essential for generalizing to unseen instructions. In this work, we challenge this prevailing consensus through a systematic empirical investigation. Counter-intuitively, we find that models trained on hard-only constraints consistently outperform those trained on mixed datasets. Extensive experiments reveal that reward precision, rather than constraint diversity, is the primary driver of effective alignment. The LLM judge suffers from a low recall rate in detecting false response, which leads to severe reward hacking, thereby undermining the benefits of diversity. Furthermore, analysis of the attention mechanism reveals that high-precision rewards develop a transferable meta-skill for IF.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics