Learning to Reason Across Parallel Samples for LLM Reasoning
Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, Eunsol Choi

TL;DR
This paper introduces a compact LLM called Sample Set Aggregator (SSA) that learns to effectively combine multiple reasoning samples to improve accuracy in large language model reasoning tasks, outperforming traditional aggregation methods.
Contribution
The paper presents SSA, a novel reinforcement learning-trained model that aggregates multiple samples for reasoning, enhancing performance and efficiency over naive methods and larger models.
Findings
SSA improves pass@5 by 8% over majority voting on MATH.
SSA surpasses larger model-based re-ranking methods.
SSA generalizes well across datasets, sample sizes, and model scales.
Abstract
Scaling test-time compute brings substantial performance gains for large language models (LLMs). By sampling multiple answers and heuristically aggregate their answers (e.g., either through majority voting or using verifiers to rank the answers), one can achieve consistent performance gains in math domains. In this paper, we propose a new way to leverage such multiple sample set. We train a compact LLM, called Sample Set Aggregator (SSA), that takes a concatenated sequence of multiple samples and output the final answer, optimizing it for the answer accuracy with reinforcement learning. Experiments on five reasoning datasets demonstrate both the efficacy and efficiency of SSA. Notably, SSA improves over naive majority voting by 8% pass@5 on MATH. Furthermore, our 3B SSA surpasses model-based re-ranking with a much larger 72B process reward model. Our analysis also shows promising…
Peer Reviews
Decision·Submitted to ICLR 2026
- simple but very clean problem formulation. The lightweight test-time scaling combines strengths of parallel sampling (cheap) with ability to reason across solutions in sequential. - comparison of training algorithms (SFT vs RL) for training SSA model - evaluation showing generalization beyond training from model with different responses
- The comparison with reranker model is not quite clear to me, which also reason on top of parallel sampling results from the LLM responses. - The paper primarily focuses on smaller/lightweight model. It's ideal to understand whether SSA model performance tops at sequential reasoning, by ablation on SSA model size and SFT reference model responses (e.g., using larger reasoning model). This is particularly interesting especially that smaller model may not be able to leverage large reasoning mode
- The core idea is simple yet effective, particularly for in-domain math reasoning tasks. - Experiments across 5 datasets demonstrate that the RL-trained, smaller SSA model can outperform larger ORM and PRM models on math reasoning benchmarks.
- The method is conceptually straightforward and lacks strong technical novelty, and the empirical gains are not substantial enough to offset this limitation. - Improvements are mostly limited to math reasoning, where prior ORM and PRM methods have shown broader generalization across diverse domains. - As shown in Figure 2, performance does not generalize well when increasing the number of input samples, and in Table 2, the results using Llama to produce answer input are also only marginally b
- The paper tackles an important problem of test-time scaling and how search/optimization/aggregation methods could be used to improve model performance, including when the base model is a blackbox/API-only model. - The paper is relatively well written, with clear structure and presentation - The paper presents empirical results indicating that their proposed SSA model could match the performance or even outperform larger PRMs, including a 3B SSA model matching/outperforming the Qwen 72B PRM mod
- The paper should include more detailed comparisons and discussions with the LLM ensemble literature where a base model could be queried in parallel with various different reasoning/role prompts with these responses subsequently aggregated [1-3]. These methods bear similarity in that K parallel candidates are also drawn from a frozen base model before being aggregated, with the aggregation possibly also being an LLM fed with the various candidate responses. - The paper's claim in proposing a m
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsBalanced Selection · Sparse Evolutionary Training
