Reward Model Generalization for Compute-Aware Test-Time Reasoning

Zeen Song; Wenwen Qiang; Siyu Zhao; Changwen Zheng; Gang Hua

arXiv:2505.18065·cs.LG·May 26, 2025

Reward Model Generalization for Compute-Aware Test-Time Reasoning

Zeen Song, Wenwen Qiang, Siyu Zhao, Changwen Zheng, Gang Hua

PDF

3 Reviews

TL;DR

This paper introduces a theoretical framework analyzing how the generalization error of reward models impacts compute efficiency and reasoning accuracy in test-time reasoning for large language models, and proposes a dynamic search method called CATS.

Contribution

It provides a PAC-Bayes based theoretical analysis of test-time reward model generalization and introduces CATS, a compute-aware search method that improves reasoning performance under compute constraints.

Findings

01

CATS outperforms existing methods on MATH and AIME benchmarks.

02

Lower reward model generalization error reduces sample requirements for correct answers.

03

Theoretical bounds link reward model error to compute efficiency.

Abstract

External test-time reasoning enhances large language models (LLMs) by decoupling generation and selection. At inference time, the model generates multiple reasoning paths, and an auxiliary process reward model (PRM) is used to score and select the best one. A central challenge in this setting is test-time compute optimality (TCO), i.e., how to maximize answer accuracy under a fixed inference budget. In this work, we establish a theoretical framework to analyze how the generalization error of the PRM affects compute efficiency and reasoning performance. Leveraging PAC-Bayes theory, we derive generalization bounds and show that a lower generalization error of PRM leads to fewer samples required to find correct answers. Motivated by this analysis, we propose Compute-Aware Tree Search (CATS), an actor-critic framework that dynamically controls search behavior. The actor outputs sampling…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 2

Strengths

- The paper provides a clear theoretical analysis of how the generalization error of the PRM affects compute efficiency and reasoning performance. - Building on the above theoretical findings, the paper designs CATS, and demonstrates effective downstream performance consistent with the theory.

Weaknesses

- The baseline comparison is not sufficiently comprehensive — the paper does not include comparisons with several recent and relevant test-time scaling methods such as DVTS [1], REBASE [2], and DORA [3], which also study compute allocation and inference efficiency. Without these, it is difficult to gauge how much of the observed improvement comes from the proposed controller versus general dynamic inference strategies. - The theoretical results are based on a set of reasonable assumptions, whic

Reviewer 02Rating 4Confidence 4

Strengths

**Originality** The paper offers a theoretically grounded perspective on inference-time reasoning that connects reward gap geometry with sampling decisions. Positioning adaptive rollout allocation as a policy-optimization problem is an interesting and under-explored angle. The PAC-Bayes analysis provides conceptual clarity on when larger rollout budgets are beneficial and yields intuitive design implications. **Quality** The method is clearly described, and experiments across multiple datasets,

Weaknesses

**1. Strength of empirical baselines.** While the paper includes common external TTS strategies, recent strong baselines such as DVTS [1] and REBASE [2] have been shown to be competitive under similar compute constraints. Without comparisons to these methods, it is difficult to judge the relative advantage of CATS. The gains reported here are modest in some configurations, especially when the underlying policy model is relatively strong, raising questions about the robustness of improvements.

Reviewer 03Rating 2Confidence 3

Strengths

- there are many theoretical results - graphs are clear and get the point across

Weaknesses

1. the performance gain over beam search seem quite marginal 2. the lower bound on the probability that the policy model generates a correct answer seems quite trivial, I don't think this can qualify as a contribution 3. if you're doing this strategy, then at each action step, you would need to stop and calculate the next best action to choose and subsequently send in a new query, correct? In practice, is this really feasible? Since sending a new query may lead to scheduling, KV cache, and synch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.