Tool Verification for Test-Time Reinforcement Learning
Ruotong Liao, Nikolai R\"ohrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh, Volker Tresp, Serena Yeung-Levy

TL;DR
This paper introduces T^3RL, a method that enhances test-time reinforcement learning by verifying external tools to improve reward accuracy, thereby reducing biased reinforcement and improving model adaptation on challenging math problems.
Contribution
The paper proposes a novel test-time tool verification mechanism for reinforcement learning, improving reward reliability and model performance on complex tasks.
Findings
T^3RL outperforms standard TTRL across multiple math benchmarks.
Verification-aware voting improves pseudo-label quality.
Larger gains observed on more difficult problems.
Abstract
Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Evolutionary Algorithms and Applications · Adversarial Robustness in Machine Learning
