Learning from Peers in Reasoning Models

Tongxu Luo; Wenyu Du; Jiaxi Bi; Stephen Chung; Zhengyang Tang; Hao Yang; Min Zhang; Benyou Wang

arXiv:2505.07787·cs.CL·May 13, 2025

Learning from Peers in Reasoning Models

Tongxu Luo, Wenyu Du, Jiaxi Bi, Stephen Chung, Zhengyang Tang, Hao Yang, Min Zhang, Benyou Wang

PDF

Open Access 3 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces LeaP, a peer-learning approach for large reasoning models that improves their self-correction ability by sharing intermediate reasoning insights among multiple paths, leading to significant performance gains.

Contribution

We propose LeaP, a novel peer interaction method that enhances reasoning models' self-correction, and develop LeaP-T, fine-tuned models that outperform existing baselines on multiple benchmarks.

Findings

01

LeaP improves accuracy by nearly 5 points on average.

02

LeaP surpasses larger models like DeepSeek-R1-671B on math benchmarks.

03

Fine-tuned LeaP-T-7B matches larger models' performance on AIME 2024.

Abstract

Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the "Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose **Learning from Peers** (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 2

Strengths

Some of the key strenghts: 1. The paper points out a widespread phenomena, that the first tokens (prefixes) can strongly influence the results up completely breaking a decoded sentence. 2. The paper introduces a test-time only modifications (with multiple agents), studying the setup under multiple models and benchmarks (AIME 2024/2025, AIMO 2025, GPQA, ZebraLogic)

Weaknesses

Related to weaknesses, I would call out: 1. Limited novelty, very similar to ensembling methods. Due to prompting, it seems rather novel, however a single naive baseline with multiple reasoning traces and vote aggregation would probably clarify better the contribution. 2. Compute comparison would be necessary, given that multiple reasoning chains easily consume a lot of inference cost. The fine-tuning comparison does not strongly make the case either, reducing a 14B to 7B model is not particula

Reviewer 02Rating 2Confidence 4

Strengths

* LeaP introduces a simple method for learning from multiple reasoning traces. Experiments show that this can outperform majority voting. * The authors conduct some analysis on how often the summarization module should be invoked and how many summaries should be aggregated.

Weaknesses

* The biggest weakness of this paper is its lack of novelty. Reasoning over multiple samples (or meta-reasoning) is an old concept (e.g., see [1], [2]) and here the only difference seems to be the summarizer module. I also don't see a clear ablation of this summarizer module to understand its usefulness. I also think that the method is not necessarily a multi-agent method because the experiments are limited to trajectories from the same underlying model. Related work also should not be in the a

Reviewer 03Rating 2Confidence 3

Strengths

- The topic is timely and relevant, aligning with current research trends in parallel reasoning - The paper is clearly written and easy to follow - The core idea—allowing reasoning paths to communicate and cross-correct—is intuitive and conceptually appealing

Weaknesses

I don’t find the empirical comparison entirely fair. - Table 1: It presents Pass@1 for independent reasoning, self-correct prompt, and LeaP. However, each LeaP path benefits from multiple peer paths, so there is no well-defined Pass@1 for LeaP. A more reasonable evaluation, in my view, would be: assuming the width of each LeaP inference is N, run M LeaP inferences and M×N independent inferences, and then report Pass@N for a fair comparison. - Table 3: - Following from the previous point, I

Code & Models

Models

Datasets

Learning-from-Peers/DeepSeek-R1-Distill-Qwen-32B-LeaP
dataset· 25 dl
25 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks · Multimodal Machine Learning Applications