Tool Verification for Test-Time Reinforcement Learning

Ruotong Liao; Nikolai R\"ohrich; Xiaohan Wang; Yuhui Zhang; Yasaman Samadzadeh; Volker Tresp; Serena Yeung-Levy

arXiv:2603.02203·cs.AI·March 3, 2026

Tool Verification for Test-Time Reinforcement Learning

Ruotong Liao, Nikolai R\"ohrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh, Volker Tresp, Serena Yeung-Levy

PDF

Open Access

TL;DR

This paper introduces T^3RL, a method that enhances test-time reinforcement learning by verifying external tools to improve reward accuracy, thereby reducing biased reinforcement and improving model adaptation on challenging math problems.

Contribution

The paper proposes a novel test-time tool verification mechanism for reinforcement learning, improving reward reliability and model performance on complex tasks.

Findings

01

T^3RL outperforms standard TTRL across multiple math benchmarks.

02

Verification-aware voting improves pseudo-label quality.

03

Larger gains observed on more difficult problems.

Abstract

Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Evolutionary Algorithms and Applications · Adversarial Robustness in Machine Learning