Diversity-Enhanced Reasoning for Subjective Questions

Yumeng Wang; Zhiyuan Fan; Jiayu Liu; Jen-tse Huang; Yi R. Fung

arXiv:2507.20187·cs.CL·March 3, 2026

Diversity-Enhanced Reasoning for Subjective Questions

Yumeng Wang, Zhiyuan Fan, Jiayu Liu, Jen-tse Huang, Yi R. Fung

PDF

Open Access 3 Reviews

TL;DR

This paper introduces MultiRole-R1, a training framework that enhances diversity in reasoning models for subjective questions, leading to significant accuracy improvements by incorporating perspective and token-level diversity.

Contribution

It proposes a novel diversity-enhanced training method with unsupervised data synthesis and reinforcement learning that improves subjective reasoning performance.

Findings

01

Increases in-domain accuracy by 14.1%

02

Out-of-domain accuracy improves by 7.64%

03

Enhances performance on advanced math reasoning tasks

Abstract

Large Reasoning Models (LRMs) with long chain-of-thought capabilities, optimized via reinforcement learning with verifiable rewards (RLVR), excel at objective reasoning tasks like mathematical problem solving and code generation. However, RLVR is known for degrading generation diversity, which causes LRMs to fall short on subjective reasoning that has multiple answers depending on different role perspectives. While recent studies recognize the importance of diversity-enhanced training in objective reasoning, limited attention has been given to subjective tasks. In this paper, we find that subjective reasoning can be improved by introducing perspective diversity and token-level diversity, with the former one providing a coherent scaffolding anchored to a real-world stakeholder group and the latter one broadening the answer search space. We propose MultiRole-R1, a diversity-enhanced…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. Novel problem focus. The paper addresses an underexplored yet important gap—reasoning diversity in subjective questions—where existing RLVR methods optimized for objective correctness tend to fail. 2. Insightful findings. The observation that diversity correlates more strongly with accuracy than reasoning length offers a new perspective on how reasoning quality may scale, potentially influencing future RL-for-reasoning research. 3. Strong empirical results. The model achieves large gains on m

Weaknesses

1. Heuristic role synthesis. The generation of role perspectives is heuristic and lacks quantitative validation to ensure that the synthesized roles truly represent distinct or complementary viewpoints rather than superficial differences. 2. Insufficient ablation and qualitative analysis. The paper lacks fine-grained ablation to disentangle the contributions of individual diversity components, and provides limited qualitative evidence that multi-role reasoning genuinely captures diverse perspect

Reviewer 02Rating 4Confidence 4

Strengths

- The paper addresses an under-explored problem — how to enhance reasoning diversity for subjective questions. - The proposed framework is conceptually clear and builds on recognizable methods (role-based prompting and RLVR). - The experiments cover several datasets (BBQ, GLOQA, ETHICS, CALI, CSQA, GSM8K, AIME-2024) and include multiple backbone models. - The analysis section connects diversity and accuracy correlations in an interpretable way.

Weaknesses

1. Marginal quantitative improvements. Despite an elaborate pipeline, the reported gains over strong baselines such as GRPO or “More-Think” are quite small (often within 1–2%), and sometimes inconsistent across datasets (Table 1). The paper frames these as large improvements, but the absolute differences do not seem practically significant, especially on subjective tasks where evaluation itself is noisy. 2. Limited novelty in the algorithmic contribution. The proposed method mainly combines kno

Reviewer 03Rating 4Confidence 4

Strengths

1. The research problem this paper focuses on is critical to the reasoning community. 2. The paper is easy to follow and well-written. 3. Experimental results are solid, and the generalization ability in the math domain is also important.

Weaknesses

1. The hyperparameter tuning is problematic. Based on the experimental results in Table 1, the hyperparameter is directly tuned on the test dataset of GLOQA, which is a data leakage problem and harms the reliability of the result. The hyperparameter needs to be tuned on a separate validation set; otherwise, the result is misleading. 2. The paper’s claim of “diversity of perspectives” remains unsubstantiated. The framework depends on the characters generated in stage one, embodying genuinely dive

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Intelligent Tutoring Systems and Adaptive Learning