Test-Time Scaling with Reflective Generative Model

Zixiao Wang; Yuxin Wang; Xiaorui Wang; Mengting Xing; Jie Gao; Jianjun Xu; Guangcan Liu; Chenhui Jin; Zhuo Wang; Shengzhuo Zhang; Hongtao Xie

arXiv:2507.01951·cs.LG·July 10, 2025

Test-Time Scaling with Reflective Generative Model

Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie

PDF

Open Access 4 Models 3 Reviews

TL;DR

This paper introduces MetaStone-S1, a reflective generative model that achieves high-quality reasoning with fewer parameters by using a novel form and self-supervised training, enabling test-time scaling and multiple reasoning modes.

Contribution

The paper presents a new reflective generative form and a self-supervised reward model, reducing reliance on annotations and enabling scalable reasoning modes.

Findings

01

MetaStone-S1 matches OpenAI o3-mini performance with 32B parameters.

02

Introduces a unified interface for policy and reward models.

03

Provides three controllable reasoning effort modes.

Abstract

We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 4

Strengths

Strong empirical validation: The experiments are extensive and include multiple benchmarks and LLMs.

Weaknesses

- SPRM is a filtering mechanism, not self-supervised: The model filters step-level data via a binary weight that retains only steps consistent with the outcome. - Limited novelty: The shared backbone between policy and reward model is an engineering optimization rather than a conceptual advance in test-time scaling. Moreover, the paper does not study whether shared parameters introduce bias. - Terminology and clarity issues: The formulation of LLMs lacks rigor (e.g., “basic LLM”)

Reviewer 02Rating 8Confidence 3

Strengths

- Originality: The paper proposes a highly original idea of using the same backbone for policy and reward models. This idea is a novel and exciting extension of prior reward models, which are typically large and separately trained. This idea opens up a lot of exciting directions for enabling richer interactions between reasoning trajectory generation and evaluation. - Quality: The proposed framework is clearly defined and well-motivated. The experimental evaluation is comprehensive, including m

Weaknesses

1. The design of the self-supervised process reward loss (SPR loss) could benefit from additional motivation and clarification. Specifically, the binary weight w_n only includes a step in the loss when the predicted step score aligns with the final outcome. Why choose a hard threshold (0.5) vs other alternatives? Could such a hard cutoff potentially discard a large fraction of training samples, particularly early in training? And could this selective inclusion behavior relate to later observatio

Reviewer 03Rating 4Confidence 4

Strengths

- By sharing the backbone network, the proposed method reduces the inference cost of using the PRM to evaluate policy rollouts. - Experimental results show that the proposed SPRM achieves superior performance with the addition of fewer parameters. - The proposed method is simple and seems to be effective.

Weaknesses

- Missing related work on process reward models. Several studies [1-4] also incorporate outcome labels to train a process reward model, which is highly relevant to this paper. - Other work [5] has introduced '\n' as a step token. What is the rationale and benefit behind selecting '\n\n' instead? A concern is that if the policy model does not generate '\n\n', how would this method remain applicable? - Regarding line 219, I have a concern about the clarification: "Since the representation in the

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning · Explainable Artificial Intelligence (XAI) · Business Process Modeling and Analysis