TL;DR
This paper introduces a new training paradigm and metric to enhance the diversity of reasoning paths in large language models, leading to improved reasoning performance and output variety.
Contribution
It proposes the 1PNS training paradigm and Reasoning Path Divergence metric to increase inference diversity and improve reasoning accuracy in LLMs.
Findings
RPD-selected training increases output diversity
Achieves +2.80% pass@16 improvement over baseline
Enhances reasoning performance on AIME24 dataset
Abstract
While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common "one problem, one solution" (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. This homogenization not only limits sampling effectiveness but also restricts the exploration space for subsequent Reinforcement Learning (RL) stages. To address this, we propose a "one problem, multiple solutions" (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The metric design is clear and reasonable. RPD introduces a fine-grained, asymmetric step-level comparison that captures strategic rather than superficial differences. - Consistent improvements across AIME24, MATH500, and OlympiadBench, with robust ablations on number of solutions, problem selection, and temperature scaling.
- Evaluation limited to math reasoning: The study focuses exclusively on quantitative tasks (AIME, MATH, Olympiad); generalization to open-ended or commonsense reasoning remains unclear. - The curation pipeline—step summarization, embedding, and pairwise distance computation—may be costly for larger datasets.
### 1. Novel and Well-Motivated Metric (RPD): RPD is a creative and principled approach to measuring semantic diversity at the step level, addressing a key limitation of embedding-based methods that conflate surface-level differences with strategic divergence. The asymmetric design is particularly insightful for handling summarization granularity. ### 2. Thorough Experimental Design: The paper includes extensive ablations, scalability tests, and diversity analyses. The authors also validate
### 1. Limited Generalization Beyond Math All experiments are conducted on math reasoning tasks (AIME24, MATH500, Olympiad Bench). While the gains are convincing, it is unclear whether RPD and 1PNS generalize to other reasoning domains (e.g., logic, science, coding), limiting the broader impact of the work. ### 2. Scalability and Compute Overhead RPD relies on LLM-based summarization and embedding computation for every solution pair, which is compute-intensive and may not scale well to larger
* clear writing * identifies an important problem * method description + experiment execution is sound
* I don't think diversity is a serious issue for math&code reasoning problems. They tend to be a problem for more subjective tasks. The problem domain selected by the author seems contrived -- i.e. since we have readily available benchmarks and datasets in math, let's do math * There should be a temperature scaling for the majority vote. It's unclear why the authors stop at T=1 (Table 6). Also, even given the results in Table 6, it's clear that the marginal benefit of RPD diminishes as temperat
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
