Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning

Jingyu Xing; Chenwei Tang; Xinyu Liu; Deng Xiong; Shudong Huang; Wei Ju; Jiancheng Lv; Ziyue Qiao

arXiv:2510.17923·cs.LG·December 10, 2025

Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning

Jingyu Xing, Chenwei Tang, Xinyu Liu, Deng Xiong, Shudong Huang, Wei Ju, Jiancheng Lv, Ziyue Qiao

PDF

Open Access 4 Reviews

TL;DR

This paper introduces COMPASS, a novel self-scoring reward mechanism for test-time reinforcement learning that improves large language models' reasoning abilities without relying on labeled data.

Contribution

It proposes a new reward system combining answer confidence calibration and reasoning path decisiveness, enabling scalable learning from unlabeled data in LLMs.

Findings

01

Significant performance improvements across reasoning tasks.

02

Enhanced model stability and answer accuracy.

03

Effective learning without external supervision.

Abstract

Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs), achieving remarkable performance in complex reasoning domains such as mathematics and code generation. However, current RL methods face a fundamental scalability bottleneck due to their heavy reliance on human-curated preference data or labeled datasets for reward modeling. To overcome this limitation, we explore RL on unlabeled data where models learn autonomously from continuous experience streams. The core challenge in this setting lies in reliable reward estimation without ground-truth supervision. Existing approaches like Test-Time RL address this through self-consistent consensus, but risk reinforcing incorrect pseudo-labels derived from majority voting. We introduce COMPASS (Composite Path and Answer Self-Scoring), a novel test-time reward mechanism that operates without…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 3

Strengths

- COMPASS effectively enables self-improvement without labeled data, making reinforcement learning scalable to unlabeled reasoning tasks. - The proposed DCAR and DPR components are well-motivated and complementary, jointly addressing reward reliability and reasoning quality.

Weaknesses

- The method still depends on self-consistency assumptions, which can reinforce systematic model biases or errors. - The paper should include more challenging and diverse benchmarks to better demonstrate the generality of COMPASS. - Error bars or statistical significance tests are missing, making it hard to assess the reliability of reported improvements. - The performance gains over TTRL are relatively modest, raising questions about the practical impact of the proposed method. - The paper prov

Reviewer 02Rating 2Confidence 4

Strengths

1. The research on RL without explicit labels for reasoning is a promising direction. 2. This paper attempts to address the fragility of pseudo-labels and evaluates the process quality, which is a good idea. 3. The design of DPR is very interesting, but I believe it may lack some empirical evidence to support it. 4. The results outperform baseline TTRL.

Weaknesses

I find the claim/method of this paper is not very convincing, and the evaluation is limited. I discuss relevant weaknesses below. ### Method The biggest problem is that the author makes many hypothetical claims without supporting empirical evidence or literature, which makes the argument unconvincing. 1. In line 218 and 241, the authors claim that "*We hypothesize that more confident responses should contribute more significantly to the final decision*" and "*Our underlying hypothesis is that a

Reviewer 03Rating 4Confidence 4

Strengths

1. The work addresses the highly significant and challenging problem of LLM self-evolution using unlabeled data in a test-time setting. 2. The proposed method is a direct and effective improvement over the TTRL baseline, thoughtfully addressing its key weaknesses in reward sparsity and process-agnosticism .

Weaknesses

1. Fragility of the DPR Heuristic: The process reward (DPR) is shown to be brittle; it fails and degrades performance on the LLaMA3.2-1B model, as the paper admits its guiding heuristic (rewarding decisiveness in high-entropy states) reinforces "fundamental confusion" in less capable models . 2. Extensive Reward Engineering: The method's success appears to rest on a complex combination of specific, fine-tuned heuristics (e.g., the exact formulas for confidence and credibility), which are not we

Reviewer 04Rating 4Confidence 3

Strengths

1. The paper is composed in a clear and fluent style, making it easy to read and understand. 2. The proposed method is simple and clearly defined, facilitating comprehension. 3. The approach to designing rewards from both **outcome** and **process** perspectives shows a degree of innovation. 4. The experimental results indicate that the method achieves observable performance improvements compared to TTRL.

Weaknesses

1. The paper presents numerous **hypotheses** throughout, yet lacks experimental validation or theoretical derivation to substantiate their validity; 2. The connection between **Decisive-Path-Reward** and **process reward** appears somewhat tenuous—assigning rewards to tokens does not inherently qualify as a process reward mechanism; 3. The experimental scope is limited to models up to **7B parameters**, which undermines the persuasiveness of the findings; additionally, the authors could stren

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)