Prover-Verifier Games improve legibility of LLM outputs

Jan Hendrik Kirchner; Yining Chen; Harri Edwards; Jan Leike; Nat; McAleese; Yuri Burda

arXiv:2407.13692·cs.CL·August 2, 2024·2 cites

Prover-Verifier Games improve legibility of LLM outputs

Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat, McAleese, Yuri Burda

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces a training method inspired by Prover-Verifier Games to improve the legibility and checkability of LLM outputs, enhancing human verification and model alignment.

Contribution

The paper proposes a novel training algorithm that enhances LLM output legibility by training small verifiers and helpful/sneaky provers, improving robustness and transferability to human verification.

Findings

01

Helpful prover accuracy increases during training

02

Verifier robustness to adversarial attacks improves

03

Human verification accuracy improves with helpful prover solutions

Abstract

One way to increase confidence in the outputs of Large Language Models (LLMs) is to support them with reasoning that is clear and easy to check -- a property we call legibility. We study legibility in the context of solving grade-school math problems and show that optimizing chain-of-thought solutions only for answer correctness can make them less legible. To mitigate the loss in legibility, we propose a training algorithm inspired by Prover-Verifier Game from Anil et al. (2021). Our algorithm iteratively trains small verifiers to predict solution correctness, "helpful" provers to produce correct solutions that the verifier accepts, and "sneaky" provers to produce incorrect solutions that fool the verifier. We find that the helpful prover's accuracy and the verifier's robustness to adversarial attacks increase over the course of training. Furthermore, we show that legibility training…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 8Confidence 4

Strengths

- Overall, this is a strong paper presenting alignment research in a novel, viable, and important direction. It focuses explicitly on training setups that are understudied and demonstrates strong results around legibility. - The paper rests on a strong theoretical foundation around prover-verifier games, takes into account how an adversarial prover might work, and presents important early results about points such as verifier sizes, iterative setups when it comes to training models towards legib

Weaknesses

- The paper conducts all experiments exclusively on GSM8k, where explanations can indeed be step by step while being natural. Moreover, all experiments are done on a single model type (GPT-4). This raises some concerns about the generalizability of the prover-verifier setup, especially to domains such as code generation, writing, etc. and settings with different base models. - The iterative training process might lead to overfitting and the early stopping conditions don’t seem clear and generali

Reviewer 02Rating 8Confidence 2

Strengths

I find the experiments very well-motivated. Prior papers in adjacent areas like debate focus on question-answering datasets; but it is obvious that research on legibility of reasoning is much more important. Although the GSM dataset is easy, this is the first paper in this direction, and honestly the experiments would have been a good contribution even if it was on a toy dataset. I also think human legibility studies are less likely to be misleading on this sort of dataset. The paper is extr

Weaknesses

**Models:** My assessment of the paper is based on the assumption that it does not matter for the purpose of this conference that the models here never available to the public in any form. In the interest of taking everything in good faith, I see two acceptable reasons for this: - there is no herd of similar models over a range of compute scales used in the paper; or - human studies had to start before models of similar capabilities were available to the public; This also assumes that the

Reviewer 03Rating 5Confidence 3

Strengths

- The paper presents an innovative adaptation of the Prover-Verifier Game to train LLMs for legibility. - It includes both theoretical proofs and empirical studies showing the benefits of their method in improving solution checkability. - The study extends beyond automated verification to demonstrate human evaluators' performance, indicating real-world applicability. - The authors acknowledge trade-offs between optimizing for accuracy and maintaining human-legible outputs, highlighting practical

Weaknesses

- The study primarily focuses on grade-school math problems; exploring broader applications could demonstrate the method's generalizability. - The paper could benefit from more discussion on integrating this training into existing LLM frameworks and the computational resources required.

Code & Models

Repositories

codelion/optillm/blob/main/optillm/pvg.py
pytorch

Videos

Prover-Verifier Games Improve Legibility of LLM outputs· youtube

Taxonomy

TopicsLogic, programming, and type systems · Multi-Agent Systems and Negotiation · Reinforcement Learning in Robotics