On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference

Yue Yu; Qiwei Di; Quanquan Gu; Dongruo Zhou

arXiv:2512.04558·cs.LG·December 5, 2025

On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference

Yue Yu, Qiwei Di, Quanquan Gu, Dongruo Zhou

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the fundamental limits of test-time compute in large language models, proposing reward-filtered sequential inference as a more effective method that improves performance by focusing on high-reward outputs.

Contribution

It introduces reward-filtered sequential inference, a novel approach that surpasses standard test-time compute methods both theoretically and empirically.

Findings

01

Reward-filtered inference provides stronger theoretical guarantees.

02

Empirical results show consistent performance improvements.

03

The method effectively concentrates computation on high-quality outputs.

Abstract

Test-time compute (TTC) has become an increasingly prominent paradigm for enhancing large language models (LLMs). Despite the empirical success of methods such as best-of- $n$ (BoN) sampling and sequential revision, their fundamental limits remain unclear. We address this gap by analyzing a mixture-of-reference policy model and proving that standard BoN is inherently suboptimal. To move closer to the optimal frontier, we study reward-filtered sequential inference, a simple procedure that selectively incorporates only high-reward generations into the context. This mechanism concentrates computation on superior policy candidates and suppresses inferior ones. On the theoretical side, we show that reward-filtered sequential inference yields strictly stronger guarantees than standard TTC paradigms. On the empirical side, we evaluate such an inference strategy across diverse benchmarks and…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

* The argument that in sequential settings we can leverage additional information is interesting and compelling. * The authors propose a realistic mixture of topics setting where BoN can fail to achieve low regret, and show that their proposed algorithm can achieve low regret. * The empirical results demonstrate the efficacy of reward-filtered sequential BoN.

Weaknesses

* SBoN is already terminology used to refer to Soft Best-of-N [1], and is already referenced as such in the community [2,3]. I would recommend using a different abbreviation and rename Algorithm 1 such as SeqBoN . * Also, Algorithm 1 is very vague. What is "Update h based on $x,a_1,...a_{i-1}$"? Can you describe a specific function or algorithm? For instance, what if I update $h$ to be $x$? Then this is just BoN, but falls under your description of SBoN. I'm not trying to make you say explicitly

Reviewer 02Rating 4Confidence 4

Strengths

The paper studies an interesting question on sequential test time compute generalizing Huang et al results for Best of N. The paper makes a lot of assumptions and derive sensible results under these assumptions.

Weaknesses

* I would like to challenge the main assumptions in the paper relating to the mixture assumption. Reading through the appendix, models are instructed as follows: "The previous solution(s) may contain errors. Before solving, briefly critique the previous attempt(s) in 2 to 3 bullet points.Then provide a COMPLETE and CONCISE corrected solution from scratch that addresses those issues. End with exactly one line containing the final answer" is it really that the trajectory is singling out a singl

Reviewer 03Rating 4Confidence 3

Strengths

Under certain assumptions on the LLM training data, the authors evaluate different TTC methods and e.g. show that vanilla BoN is suboptimal. The experimental results show that the proposed methods outperforms vanilla BoN and a simple sequential baseline in terms of accuracy over $N$.

Weaknesses

### Presentation I find the paper somewhat difficult to follow. I'm not sure what's the best way to present things, but I feel like the paper could be improved by a more clear presentation, including a more precise problem statement, more clear definitions, and goals throughout the paper/sections. (E.g., a more clear distinction between "showing that BoN under these assumptions on the pretraining data is suboptimal", and "derive a new TTC method for sequential TTC".) ### Theory The assumptions

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning and Algorithms · Natural Language Processing Techniques