Learning to Reason Across Parallel Samples for LLM Reasoning

Jianing Qi; Xi Ye; Hao Tang; Zhigang Zhu; Eunsol Choi

arXiv:2506.09014·cs.CL·October 13, 2025

Learning to Reason Across Parallel Samples for LLM Reasoning

Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, Eunsol Choi

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a compact LLM called Sample Set Aggregator (SSA) that learns to effectively combine multiple reasoning samples to improve accuracy in large language model reasoning tasks, outperforming traditional aggregation methods.

Contribution

The paper presents SSA, a novel reinforcement learning-trained model that aggregates multiple samples for reasoning, enhancing performance and efficiency over naive methods and larger models.

Findings

01

SSA improves pass@5 by 8% over majority voting on MATH.

02

SSA surpasses larger model-based re-ranking methods.

03

SSA generalizes well across datasets, sample sizes, and model scales.

Abstract

Scaling test-time compute brings substantial performance gains for large language models (LLMs). By sampling multiple answers and heuristically aggregate their answers (e.g., either through majority voting or using verifiers to rank the answers), one can achieve consistent performance gains in math domains. In this paper, we propose a new way to leverage such multiple sample set. We train a compact LLM, called Sample Set Aggregator (SSA), that takes a concatenated sequence of multiple samples and output the final answer, optimizing it for the answer accuracy with reinforcement learning. Experiments on five reasoning datasets demonstrate both the efficacy and efficiency of SSA. Notably, SSA improves over naive majority voting by 8% pass@5 on MATH. Furthermore, our 3B SSA surpasses model-based re-ranking with a much larger 72B process reward model. Our analysis also shows promising…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- simple but very clean problem formulation. The lightweight test-time scaling combines strengths of parallel sampling (cheap) with ability to reason across solutions in sequential. - comparison of training algorithms (SFT vs RL) for training SSA model - evaluation showing generalization beyond training from model with different responses

Weaknesses

- The comparison with reranker model is not quite clear to me, which also reason on top of parallel sampling results from the LLM responses. - The paper primarily focuses on smaller/lightweight model. It's ideal to understand whether SSA model performance tops at sequential reasoning, by ablation on SSA model size and SFT reference model responses (e.g., using larger reasoning model). This is particularly interesting especially that smaller model may not be able to leverage large reasoning mode

Reviewer 02Rating 2Confidence 4

Strengths

- The core idea is simple yet effective, particularly for in-domain math reasoning tasks. - Experiments across 5 datasets demonstrate that the RL-trained, smaller SSA model can outperform larger ORM and PRM models on math reasoning benchmarks.

Weaknesses

- The method is conceptually straightforward and lacks strong technical novelty, and the empirical gains are not substantial enough to offset this limitation. - Improvements are mostly limited to math reasoning, where prior ORM and PRM methods have shown broader generalization across diverse domains. - As shown in Figure 2, performance does not generalize well when increasing the number of input samples, and in Table 2, the results using Llama to produce answer input are also only marginally b

Reviewer 03Rating 4Confidence 3

Strengths

- The paper tackles an important problem of test-time scaling and how search/optimization/aggregation methods could be used to improve model performance, including when the base model is a blackbox/API-only model. - The paper is relatively well written, with clear structure and presentation - The paper presents empirical results indicating that their proposed SSA model could match the performance or even outperform larger PRMs, including a 3B SSA model matching/outperforming the Qwen 72B PRM mod

Weaknesses

- The paper should include more detailed comparisons and discussions with the LLM ensemble literature where a base model could be queried in parallel with various different reasoning/role prompts with these responses subsequently aggregated [1-3]. These methods bear similarity in that K parallel candidates are also drawn from a frozen base model before being aggregated, with the aggregation possibly also being an LLM fed with the various candidate responses. - The paper's claim in proposing a m

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsBalanced Selection · Sparse Evolutionary Training