Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning

Zhaowei Zhang; Xiaohan Liu; Xuekai Zhu; Junchao Huang; Ceyao Zhang; Zhiyuan Feng; Yaodong Yang; Xiaoyuan Yi; Xing Xie

arXiv:2603.10588·cs.AI·March 12, 2026

Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning

Zhaowei Zhang, Xiaohan Liu, Xuekai Zhu, Junchao Huang, Ceyao Zhang, Zhiyuan Feng, Yaodong Yang, Xiaoyuan Yi, Xing Xie

PDF

Open Access

TL;DR

This study empirically compares distribution-matching and reward-maximizing RLVR methods for LLM alignment in moral reasoning, finding that standard reward-maximizing approaches are sufficiently effective without requiring diversity-preserving algorithms.

Contribution

The paper provides the first comprehensive empirical comparison of RLVR paradigms on moral reasoning tasks, challenging the assumption that diversity-seeking methods are necessary for alignment.

Findings

01

Reward-maximizing methods perform as well or better than distribution-matching approaches.

02

Moral reasoning responses are more concentrated in semantic space than mathematical reasoning.

03

Diversity-preserving algorithms are not inherently necessary for effective LLM alignment.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in logical reasoning tasks, yet whether large language model (LLM) alignment requires fundamentally different approaches remains unclear. Given the apparent tolerance for multiple valid responses in moral reasoning, a natural hypothesis is that alignment tasks inherently require diversity-seeking distribution-matching algorithms rather than reward-maximizing policy-based methods. We conduct the first comprehensive empirical study comparing both paradigms on MoReBench. To enable stable RLVR training, we build a rubric-grounded reward pipeline by training a Qwen3-1.7B judge model. Contrary to our hypothesis, we find that distribution-matching approaches do not demonstrate significant advantages over reward-maximizing methods as expected on alignment tasks. Through semantic visualization mapping…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Topic Modeling · Explainable Artificial Intelligence (XAI)