Test-Time Scaling with Reflective Generative Model
Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie

TL;DR
This paper introduces MetaStone-S1, a reflective generative model that achieves high-quality reasoning with fewer parameters by using a novel form and self-supervised training, enabling test-time scaling and multiple reasoning modes.
Contribution
The paper presents a new reflective generative form and a self-supervised reward model, reducing reliance on annotations and enabling scalable reasoning modes.
Findings
MetaStone-S1 matches OpenAI o3-mini performance with 32B parameters.
Introduces a unified interface for policy and reward models.
Provides three controllable reasoning effort modes.
Abstract
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking…
Peer Reviews
Decision·ICLR 2026 Poster
Strong empirical validation: The experiments are extensive and include multiple benchmarks and LLMs.
- SPRM is a filtering mechanism, not self-supervised: The model filters step-level data via a binary weight that retains only steps consistent with the outcome. - Limited novelty: The shared backbone between policy and reward model is an engineering optimization rather than a conceptual advance in test-time scaling. Moreover, the paper does not study whether shared parameters introduce bias. - Terminology and clarity issues: The formulation of LLMs lacks rigor (e.g., “basic LLM”)
- Originality: The paper proposes a highly original idea of using the same backbone for policy and reward models. This idea is a novel and exciting extension of prior reward models, which are typically large and separately trained. This idea opens up a lot of exciting directions for enabling richer interactions between reasoning trajectory generation and evaluation. - Quality: The proposed framework is clearly defined and well-motivated. The experimental evaluation is comprehensive, including m
1. The design of the self-supervised process reward loss (SPR loss) could benefit from additional motivation and clarification. Specifically, the binary weight w_n only includes a step in the loss when the predicted step score aligns with the final outcome. Why choose a hard threshold (0.5) vs other alternatives? Could such a hard cutoff potentially discard a large fraction of training samples, particularly early in training? And could this selective inclusion behavior relate to later observatio
- By sharing the backbone network, the proposed method reduces the inference cost of using the PRM to evaluate policy rollouts. - Experimental results show that the proposed SPRM achieves superior performance with the addition of fewer parameters. - The proposed method is simple and seems to be effective.
- Missing related work on process reward models. Several studies [1-4] also incorporate outcome labels to train a process reward model, which is highly relevant to this paper. - Other work [5] has introduced '\n' as a step token. What is the rationale and benefit behind selecting '\n\n' instead? A concern is that if the policy model does not generate '\n\n', how would this method remain applicable? - Regarding line 219, I have a concern about the clarification: "Since the representation in the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI-based Problem Solving and Planning · Explainable Artificial Intelligence (XAI) · Business Process Modeling and Analysis
