VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers
Jianing Qi, Hao Tang, Zhigang Zhu

TL;DR
VerifierQ introduces a novel Q-learning-based verifier for LLMs, improving reasoning accuracy and efficiency by integrating advanced RL techniques into the verifier component.
Contribution
It pioneers the application of Offline Q-learning, including Implicit and Conservative Q-learning, to LLM verifier models, addressing key challenges like large action spaces and overestimation bias.
Findings
Outperforms supervised fine-tuning in mathematical reasoning tasks
Enhances efficiency and robustness of LLM verification
Enables parallel Q-value computation for faster training
Abstract
Recent advancements in test time compute, particularly through the use of verifier models, have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). This generator-verifier approach closely resembles the actor-critic framework in reinforcement learning (RL). However, current verifier models in LLMs often rely on supervised fine-tuning without temporal difference learning such as Q-learning. This paper introduces VerifierQ, a novel approach that integrates Offline Q-learning into LLM verifier models. We address three key challenges in applying Q-learning to LLMs: (1) handling utterance-level Markov Decision Processes (MDPs), (2) managing large action spaces, and (3) mitigating overestimation bias. VerifierQ introduces a modified Bellman update for bounded Q-values, incorporates Implicit Q-learning (IQL) for efficient action space management, and integrates a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Software Testing and Debugging Techniques · Topic Modeling
MethodsQ-Learning
