Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking

Feng Ju; Zeyu Qin; Rui Min; Zhitao He; Lingpeng Kong; Yi R. Fung

arXiv:2510.26122·cs.CL·January 6, 2026

Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking

Feng Ju, Zeyu Qin, Rui Min, Zhitao He, Lingpeng Kong, Yi R. Fung

PDF

3 Reviews

TL;DR

This paper introduces a new training paradigm and metric to enhance the diversity of reasoning paths in large language models, leading to improved reasoning performance and output variety.

Contribution

It proposes the 1PNS training paradigm and Reasoning Path Divergence metric to increase inference diversity and improve reasoning accuracy in LLMs.

Findings

01

RPD-selected training increases output diversity

02

Achieves +2.80% pass@16 improvement over baseline

03

Enhances reasoning performance on AIME24 dataset

Abstract

While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common "one problem, one solution" (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. This homogenization not only limits sampling effectiveness but also restricts the exploration space for subsequent Reinforcement Learning (RL) stages. To address this, we propose a "one problem, multiple solutions" (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 3

Strengths

- The metric design is clear and reasonable. RPD introduces a fine-grained, asymmetric step-level comparison that captures strategic rather than superficial differences. - Consistent improvements across AIME24, MATH500, and OlympiadBench, with robust ablations on number of solutions, problem selection, and temperature scaling.

Weaknesses

- Evaluation limited to math reasoning: The study focuses exclusively on quantitative tasks (AIME, MATH, Olympiad); generalization to open-ended or commonsense reasoning remains unclear. - The curation pipeline—step summarization, embedding, and pairwise distance computation—may be costly for larger datasets.

Reviewer 02Rating 4Confidence 3

Strengths

### 1. Novel and Well-Motivated Metric (RPD): RPD is a creative and principled approach to measuring semantic diversity at the step level, addressing a key limitation of embedding-based methods that conflate surface-level differences with strategic divergence. The asymmetric design is particularly insightful for handling summarization granularity. ### 2. Thorough Experimental Design: The paper includes extensive ablations, scalability tests, and diversity analyses. The authors also validate

Weaknesses

### 1. Limited Generalization Beyond Math All experiments are conducted on math reasoning tasks (AIME24, MATH500, Olympiad Bench). While the gains are convincing, it is unclear whether RPD and 1PNS generalize to other reasoning domains (e.g., logic, science, coding), limiting the broader impact of the work. ### 2. Scalability and Compute Overhead RPD relies on LLM-based summarization and embedding computation for every solution pair, which is compute-intensive and may not scale well to larger

Reviewer 03Rating 4Confidence 4

Strengths

* clear writing * identifies an important problem * method description + experiment execution is sound

Weaknesses

* I don't think diversity is a serious issue for math&code reasoning problems. They tend to be a problem for more subjective tasks. The problem domain selected by the author seems contrived -- i.e. since we have readily available benchmarks and datasets in math, let's do math * There should be a temperature scaling for the majority vote. It's unclear why the authors stop at T=1 (Table 6). Also, even given the results in Table 6, it's clear that the marginal benefit of RPD diminishes as temperat

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.