CTTS: Collective Test-Time Scaling

Zhende Song; Shengji Tang; Peng Ye; Jiayuan Fan; Lei Bai; Tao Chen; Wanli Ouyang

arXiv:2508.03333·cs.CL·September 30, 2025

CTTS: Collective Test-Time Scaling

Zhende Song, Shengji Tang, Peng Ye, Jiayuan Fan, Lei Bai, Tao Chen, Wanli Ouyang

PDF

Open Access 4 Reviews

TL;DR

This paper introduces Collective Test-Time Scaling (CTTS), a novel multi-agent, multi-reward framework that significantly improves large language model performance by leveraging collaboration among multiple models and reward systems.

Contribution

The paper systematically analyzes interaction paradigms and proposes CTTS-MM, a new framework with agent collaboration search and reward model ensemble strategies, surpassing existing single-agent methods.

Findings

01

MA-MR paradigm outperforms other interaction types

02

CTTS-MM achieves +4.82% over Best-of-N

03

CTTS-MM surpasses GPT-4.1 and open-source LLMs

Abstract

Test-time scaling (TTS) has emerged as a promising, training-free approach for enhancing large language model (LLM) performance. However, the efficacy of existing methods, such as Best-of-N and Self-Consistency, is fundamentally constrained by the dominant single test-time scaling (STTS) paradigm, which relies on a single LLM agent interacting with a single reward model (SA-SR). Inspired by recent work showing that collective methods can surpass the performance ceiling of individual models, we introduce Collective Test-Time Scaling (CTTS). First, we systematically investigate three primary interaction paradigms of existing multiple models: single-agent-multi-reward (SA-MR), multi-agent-single-reward (MA-SR), and multi-agent-multi-reward (MA-MR). Extensive experiments reveal that the MA-MR paradigm is consistently superior. Based on this finding, we further propose CTTS-MM, a novel…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

- **Originality**: First work to formalize and systematically explore CTTS paradigms. - **Quality**: Rigorous experimentation across seven diverse benchmarks, with clear performance gains. - **Clarity**: Well-organized methodology and ablation studies. - **Significance**: Demonstrates that collective scaling can surpass flagship proprietary models using only open-source components, highlighting a promising research direction.

Weaknesses

- **Experimental Breadth**: While seven benchmarks are used, they primarily focus on reasoning and coding tasks; inclusion of more diverse NLP tasks (e.g., dialogue, summarization) would strengthen generalizability claims. - **Technical Details**: Some algorithmic details are vague (e.g., the exact implementation of the aggregator in ACS, hyperparameter sensitivity analysis). - **Computational Cost**: Although inference time is discussed, a deeper analysis of the computational overhead of ACS an

Reviewer 02Rating 4Confidence 2

Strengths

1. The paper clearly formalizes collective test-time scaling and provides a systematic comparison of different collaboration paradigms. 2. The proposed CTTS-MM integrates model search and reward aggregation in a coherent and well-motivated framework. 3. Experiments are extensive, covering seven benchmarks, ten models, and eight reward models, with consistent and strong results. 4. The ablation and efficiency analyses are detailed and help demonstrate the contribution of each component.

Weaknesses

1. The paper lacks theoretical explanation of why multi-agent–multi-reward scaling leads to consistent improvements. 2. The robustness of the framework to inaccurate or biased reward models is not studied. 3. The ablations do not include simple baselines such as random selection or uniform reward weighting.

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper in general is well-presented, and easy to follow. The research problem has been well-articulated to the reader. 2. The paper offers a comprehensive analysis of the performance across various configurations, including single-agent single-reward, single-agent multi-reward, and multi-agent multi-reward test-time scaling approaches. 3. The experimental evaluation is comprehensive, covering diverse benchmarks such as mathematical reasoning, coding, knowledge-based question answering,

Weaknesses

1. The primary concern lies in the limited technical novelty of the proposed method. Based on the established ensembling literature [1–5], it is intuitive to assume that leveraging multiple agents would naturally enhance performance. Moreover, even in the context of test-time scaling, recent studies have also validated this assumption [5,6]. In addition, the proposed agent-collaborative framework essentially performs a greedy search, and the reward model selection relies on dot-product similarit

Reviewer 04Rating 2Confidence 5

Strengths

1. The paper focuses on the timely and popular topic of TTS and extends it by considering multi-model ensembling. Empirical pre-test results show that incorporating multiple models and verifiers brings improvements in TTS performance, indicating practical significance. 2. The proposed empirical insights are exciting: under appropriate strategies, multi-model and multi-verifier settings outperform single-model or single-verifier baselines. These findings lay a foundation for further improving TT

Weaknesses

1. The overall presentation and clarity require improvement. + (1) Key concepts are insufficiently defined and wrongly used. The paper restricts the definition of TTS to the “agent” setting, whereas most prior work considers LLMs more generally without limiting to agent-based contexts. Furthermore, it describes TTS as a two-stage process, parallel answer generation followed by selection, which only corresponds to parallel TTS and omits other paradigms such as tree search and self-refinement. The

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Imaging Techniques and Applications · Advanced MRI Techniques and Applications