VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based   Verifiers

Jianing Qi; Hao Tang; Zhigang Zhu

arXiv:2410.08048·cs.LG·October 11, 2024

VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

Jianing Qi, Hao Tang, Zhigang Zhu

PDF

Open Access

TL;DR

VerifierQ introduces a novel Q-learning-based verifier for LLMs, improving reasoning accuracy and efficiency by integrating advanced RL techniques into the verifier component.

Contribution

It pioneers the application of Offline Q-learning, including Implicit and Conservative Q-learning, to LLM verifier models, addressing key challenges like large action spaces and overestimation bias.

Findings

01

Outperforms supervised fine-tuning in mathematical reasoning tasks

02

Enhances efficiency and robustness of LLM verification

03

Enables parallel Q-value computation for faster training

Abstract

Recent advancements in test time compute, particularly through the use of verifier models, have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). This generator-verifier approach closely resembles the actor-critic framework in reinforcement learning (RL). However, current verifier models in LLMs often rely on supervised fine-tuning without temporal difference learning such as Q-learning. This paper introduces VerifierQ, a novel approach that integrates Offline Q-learning into LLM verifier models. We address three key challenges in applying Q-learning to LLMs: (1) handling utterance-level Markov Decision Processes (MDPs), (2) managing large action spaces, and (3) mitigating overestimation bias. VerifierQ introduces a modified Bellman update for bounded Q-values, incorporates Implicit Q-learning (IQL) for efficient action space management, and integrates a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Software Testing and Debugging Techniques · Topic Modeling

MethodsQ-Learning